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Abstract 

In  regression  analysis  the  response  variable  Y  and  the  predictor 
variables  are  often  replaced  by  functions  9(Y)  and 

<t>1  (X-j  (X  ).  We  discuss  a  procedure  for  estimating  those  functions 

0*  and  4>*,...,4>*  that  minimize 


e 


2 


E([0(Y)  -  l  <MX.)]2} 

_ j=l  J  J 

Var[9(Y)] 


given  only  a  sample  { (yjc»xkl  »•  •  •  1  <_k  <^N>  and  making  minimal 

assumptions  concerning  the  data  distribution  or  the  form  of  the  solution 

functions.  For  the  bivariate  case,  p=l,  0*  and  <j>*  satisfy 

p*  =  p(e*,<{>*)  =  max  p[0(Y) ,4>(X)]  where  p  is  the  product  moment  corre- 

lation  coefficient  and  p*  is  the  maximal  correlation  between  X  and 

Y.  Our  procedure  thus  also  provides  a  method  for  estimating  the  maximal 

correlation  between  two  variables. 

★  ”  r~  ~ 
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1 .  Introduction 

Nonlinear  transformation  of  variables  is  a  commonly  used  practice 
in  regression  problems.  Two  common  goals  are  stabilization  of  error 
variance  and  symmetrization/ normalization  of  error  distribution.  A  more 
comprehensive  goal,  and  the  one  we  adopt,  is  to  find  those  transformations 
that  produce  the  best  fitting  additive  model.  Knowledge  of  such  trans¬ 
formations  aid  in  the  interpretation  and  understanding  of  the  relation¬ 
ship  between  the  response  and  predictors. 

Let  Y,X-| , . . .  ,Xp  be  random  variables  with  Y  the  response  and 
Xr...,Xp  the  predictors.  Let  6  (Y)  ,<j>1  (X1 ) ,. . .  ,<fr  (X  )  be  arbitrary 
measurable  functions  of  the  corresponding  random  variables.  The  fraction 

of  variance  not  explained  (e2)  by  a  regression  of  8(Y)  on  f  <j>.(X.) 

i=l  1  1 

is 

E{[e(Y)  -  j  4>.(X.)]2} 

O-l)  e2(e,^,...,0p)  =  VarL0(Y)] 

Then  define  optimal  transformations  as  functions  . . .  ,<|>*  that 

minimize  (1.1):  i.e. 

(1.2)  e  (0  » <}>-,  »•••»<(>  )  =  min  e  ,. . .  ,<p  )  . 

P  0*^i  »••  •  »4>p  P 

We  show  in  Section  5  that  optimal  transformations  exist  and  satisfy 
a  complex  system  of  integral  equations.  The  heart  of  our  approach  is 
that  there  is  a  simple  iterative  algorithm  using  only  bivariate  condi¬ 
tional  expectations  which  converges  to  an  optimal  solution.  When  the 
conditional  expectations  are  estimated  from  a  finite  data  set,  then  use 
of  the  algorithm  results  in  estimates  of  the  optimal  transformations. 
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Thi s  method  has  some  powerful  characteristics.  It  can  be  applied  in 
situations  where  the  response  and/or  the  predictors  involve  arbitrary 
mixtures  of  continuous  ordered  variables  and  categorical  variables  (ordered 

or  unordered).  The  functions  6,4>j . $  are  real  valued.  If  the 

original  variable  is  categorical,  the  application  of  6  or  4^.  assigns 
a  real  valued  score  to  each  of  its  categorical  values. 

The  procedure  is  non parametric.  The  optimal  transformation  estimates 

are  based  solely  on  the  data  sample  {(yk»xkl  . xkp),  ^  -k  with 

minimal  assumptions  concerning  the  data  distribution  and  the  form  of  the 
optimal  transformations.  The  principal  distributional  assumption  is  that 
the  data  are  i.i.d.  In  particular,  we  do  not  require  the  transformation 
functions  to  be  from  a  particular  parameterized  family  or  even  monotone. 

(We  illustrate  below  situations  where  the  optimal  transformations  are  not 
monotone. ) 

For  the  bivariate  case,  p=l,  the  optimal  transformations  e*(Y), 
4>*(X)  satisfy 

(1.3)  p*(X,Y)  =  p(6*,4>*)  =  max  p[e(Y),4»(X)] 

e,4> 

where  p  is  the  product  moment  correlation  coefficient.  The  quantity 
P*(X,Y)  is  known  as  the  maximal  correlation  between  X  and  Y,  and  is 
used  as  a  general  measure  of  dependence  (Gebelern  [1947];  see  also  Renyi 
[1959]  and  Sarmanov  [1958A.B]).  The  maximal  correlation  has  the  follow¬ 
ing  properties  (Renyi  [1959]): 

(a)  0  <_  p*(X,Y)  <  1 

(b)  p*(X,Y)  =  0  if  and  only  if  X  and  Y  are  independent 
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(c)  if  there  exists  a  relation  of  the  form  u ( X )  =  v(Y)  where  u 
and  v  are  Borel -measurable  functions  with  var[u(X)]  >  0, 
then  p*(X,Y)  =  1 . 

Therefore,  in  the  bivariate  case  our  procedure  can  also  be  regarded  as  a 
method  for  estimating  the  maximal  correlation  between  two  variables, 
providing  as  a  by-product  estimates  of  the  functions  6*,  <t>*  that  achieve 
the  maximum. 

In  the  next  section,  we  describe  our  procedure  for  finding  optimal 
transformations  using  algorithmic  notation,  deferring  mathematical 
justifications  to  sections  5  and  6.  We  next  illustrate  the  procedure  in 
Section  3  by  applying  it  to  a  number  of  simulated  data  sets  where  the 
optimal  transformations  are  known.  The  estimates  are  surprisingly  good. 

Our  algorithm  is  also  applied  to  the  Boston  housing  data  of  Harrison  and 
Rubinfeld  [1978]  as  listed  in  Belsey,  Kuh  and  Welsch  [1980].  The  trans¬ 
formations  found  by  the  algorithm  generally  differ  from  those  applied  in 
the  original  analysis.  (A  FORTRAN  implementation  of  our  algorithm  is 
listed  in  Appendix  2).  Section  4  presents  a  general  discussion  and  relates 
this  procedure  to  other  empirical  methods  for  finding  transformations. 

Sections  5,  6,  and  Appendix  1  provide  some  theoretical  framework  for 
the  algorithm.  In  Section  5,  under  weak  conditions  on  the  joint  distribu¬ 
tion  of  Y,Xj,...,X  ,  it  is  shown  that  optimal  transformations  exist  and 
are  generally  unique  up  to  a  change  of  sign.  The  optimal  transformations 
are  characterized  as  the  eigenfunctions  of  a  set  of  linear  integral  equa¬ 
tions  whose  kernels  involve  bivariate  distributions.  We  then  show  that 
our  procedure  converges  to  optimal  transformations. 

Section  6  discusses  the  algorithm  as  applied  to  finite  data  sets. 

The  results  are  dependent  on  the  type  of  data  smooth  employed  to  estimate 
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the  bivariate  conditional  expectations.  Convergence  of  the  algorithm  is 
proven  only  for  a  very  restricted  class  of  data  smooths.  However,  in  over 
a  thousand  applications  of  the  algorithm  on  a  variety  of  data  sets  using 
three  different  types  of  data  smoothers  only  one  (very  contrived)  instance 
of  non  convergence  has  been  found. 

Section  6  also  contains  proof  of  a  consistency  result.  Under  fairly 
general  conditions,  as  the  sample  size  increases  the  finite  data  trans¬ 
formations  converge  in  a  "weak"  sense  to  the  distributional  space  optimal 
transformations.  Finally,  Appendix  1  contains  a  brief  discussion  of  the 
needed  consistency  properties  of  bivariate  smooths. 

This  paper  is  laid  out  in  two  distinct  parts.  Sections  1-4  give  a 
fairly  non-technical  overview  of  the  method  and  discus:  its  application  to 
data.  Sections  5  and  6  are,  of  necessity,  more  technical,  presenting  the 
theoretical  foundation  for  the  procedure. 
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2.  The  Algorithm 

Our  procedure  for  finding  0*,<J>*, . . .  ,4>*  is  iterative.  Assume  a  known 
distribution  for  the  variables  Y,X^,...,X  .  Without  loss  of  generality, 
let  var[e(Y)]  =  1,  and  assume  that  all  functions  have  expectation  zero. 

To  illustrate,  we  first  look  at  the  bivariate  case; 

(2.1)  e2(e,<p)  =  E[e(Y)-4>(X)]2 

Consider  the  minimization  of  (2.1)  with  respect  to  9(Y)  for  a  given 
function  4>(X).  The  solution  is 

(2.2)  9 1  (Y)  =  E[<j>(X)  1  Y]/llE[<j>(X)  |  Y]ll 
2  1/2 

with  B*H  i  [E(-)  ]  '  .  Next,  consider  the  minimization  of  (2.1)  with 
respect  to  4>(X)  for  a  given  0(Y).  The  solution  is 

(2.3)  <0*  (X)  -  E [ 0 ( Y )  |  X]  . 

Equations  (2.2)  and  (2.3)  form  the  basis  of  an  iterative  optimization 
procedure  involving  alternating  conditional  expectations  (ACE): 

BASIC  ACE  ALGORITHM 
set  9(Y)  =  Y/IYII ; 

ITERATE  UNTIL  e2(0,<j>)  fails  to  decrease: 

4> *  ( X )  =  E[9(Y)  j  X]; 
replace  <j>(X)  with  <f>'(X); 

0'(Y)  =  EC<t>(X )  ( Y]/ II £[<b( X )  |  Y]I; 
replace  e(Y)  with  p' (Y); 

END  ITERATION  LOOP; 

0  and  <|>  are  the  solutions  0*  and  <£*; 

END  ALGORTHM; 
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This  algorithm  decreases  (2.1)  at  each  step  by  alternatingly  mini¬ 
mizing  with  respect  to  one  function  holding  the  other  fixed  at  its 
previous  evaluation.  Each  iteration  (execution  of  the  iteration  loop) 
performs  one  pair  of  these  single  function  minimizations.  The  process 
begins  with  an  initial  guess  for  one  of  the  functions  (6  =  Y/IIYH  above) 
and  ends  when  a  complete  iteration  pass  fails  to  decrease  e  (2.1).  In 
Section  5,  we  prove  that  the  algorithm  converges  to  optimal  transformations 
0*.  <fr*. 

Now  consider  the  more  general  case  of  multiple  predictors  X^,...,X  . 
We  proceed  in  direct  analogy  with  the  basic  ACE  algorithm;  we  minimize 

(2.4)  e2(0, <t  )  =  E[6(Y)  -  £  4>.(X.)]2  , 

i  P  j=l  J  J 

2 

holding  E0  =1,  E9  5  ^  =  E4>p  =  0,  through  a  series  of  single 

function  minimizations  involving  bivariite  conditional  expectations.  For 
a  given  set  of  functions  4>i  (X^ ) ,. .  .,4>p(Xp),  minimization  of  (2.4)  with 
respect  to  e(Y)  yields 

(2.5)  0 '  (Y )  -  E[  l  <)>•  (X. )  |Y]/IIE[  £  *.  (X. )  |  Y]ll 

i=l  11  i=l  1  1 

The  next  step  is  to  minimize  (2.4)  with  respect  to  <f>-|  (X-| ),..., 6  (Xp) 
given  0(Y).  This  is  obtained  through  another  iterative  algorithm. 

Consider  the  minimization  of  (2.4)  with  respect  to  a  single  function 
<f>k(xk)  forgiven  e(Y)  and  a  given  set  <J>-, , . . .  ,4>k_i  ,4>k+1 ,. . .  ,<pp.  The 
solution  is 

(2.6)  $'(Xk)  =  E[0(Y)  -  .Jk*i(Xi)lxk]  • 

The  corresponding  iterative  algorithm  is  then: 
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set  <(>1(X1 <t>p(Xp)  =  0; 

ITERATE  UNTIL  e^e,^  , . . .  ,4^)  fails  to  decrease; 

FOR  k  =1  TO  p  DO; 

^(Xk)  »  E[9(Y)  -  l  <t>i(X.)|Xk]; 

replace  $k(X.  )  with  <^(X.  ); 

END  FOR  LOOP; 

END  ITERATION  LOOP; 

4>-j , . . .  ,4>p  are  the  solution  functions; 

2 

Each  iteration  of  the  inner  FOR  loop  minimizes  e  (2.4)  with  respect  to 
the  function  4>k(Xk),  k=l,...,p  with  all  other  functions  fixed  at 
their  previous  evaluations  (execution  of  the  FOR  loop).  The  outer  loop 
is  iterated  until  one  complete  pass  over  the  predictor  variables  (inner 

o 

FOR  loop)  fails  to  decrease  e  (2.4). 

Substituting  this  procedure  for  the  corresponding  single  function 

optimization  in  the  bivariate  ACE  algorithm  gives  rise  to  the  full  ACE 

2 

algorithm  for  minimizing  e  (2.4): 

ACE  ALGORITHM: 

set  e(Y)  =  Y/li Yfl  and  4>,  (X. ),.  ..,*n(Xn)  =  0; 

o  11  P  P 

ITERATE  UNTIL  e  (0,<j>^ , . . .  ,<pp )  fails  to  decrease; 

ITERATE  UNTIL  e^(0 ,4>^ ..  »4>p)  fails  to  decrease; 

FOR  k  =1  TO  p  DO: 

*U\)  =  E[0(Y)  -  y  ❖1*(Xi) Jxk]; 

replace  4>t(Xk)  with  <j>MX.  ); 

END  FOR  LOOP; 

END  INNER  ITERATION  LOOP; 

0'(Y)  =  E[  \  4>i (X. )  |  Y]/IE[  \  4>i(X.)tY]fl 
i=l  1  1  i=l  1  1 

replace  9(Y)  with  0 ’ ( Y) ; 

END  OUTER  ITERATION  LOOP; 

6,<t>j . .  ,4>p  are  the  solutions  0*,^, . . . 

END  ACE  ALGORTHM; 


-9- 


In  Section  5,  we  prove  that  the  ACE  algorithm  converges  to  optimal  trans¬ 
formations. 


1 
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3.  Applications 

In  the  previous  section,  the  ACE  algorithm  was  developed  in  the  con¬ 
text  of  known  distributions.  In  practice,  data  distributions  are  seldom 

known.  Instead,  one  has  a  data  set  {(yk,xkl . xkp),  1  that  is 

presumed  to  be  a  sample  from  Y,X^,...,X  .  The  goal  is  to  estimate  the 
optimal  transformation  functions  9(Y), (X  )  from  the  data. 

This  can  be  accomplished  by  applying  the  ACE  algorithm  to  the  data  with 
2 

the  quantity  e  ,  II  I ,  and  the  conditional  expectations  replaced  by 

suitable  estimates.  The  resulting  functions  §*,$*,...,$*  are  then 

taken  as  estimates  of  the  corresponding  optimal  transformations. 

2 

The  estimate  for  e  is  the  usual  mean  squared  error  for  regression, 
e  (e,<j>-| ... . ,<tp)  =  I  £0(yk)  ■ 

p 

If  g(y,x-| , . . .  ,xp)  is  a  function  defined  for  all  data  values,  then  llgll 
is  replaced  by 

89  =  N  k^9  ^k’xk1  ’ ' '  ‘  ,xkp^  • 

For  the  case  of  categorical  variables,  the  conditional  expectation  estimates 
are  straightforward: 

E[A|Z.z]  =  I  A .  /  J  1 

zrz  zrz 

where  A  is  a  real  valued  quantity  and  the  sum;,  are  over  the  subset  of 
observations  having  (categorical)  value  Z-z.  For  variables  that  can 
assume  many  ordered  values,  the  estimation  is  based  on  smoothing  techniques. 
Such  procedures  have  been  the  subject  of  considerable  study  (see,  for 
example,  Gasser  and  Rosenblatt  [1979],  Cleveland  [1979],  Craven  and 
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Wahba  [1979]).  Since  the  smoother  is  repeatedly  applied  in  the  algorithm, 
high  speed  is  desirable,  as  well  as  adaptability  to  local  curvature.  We 
use  a  smoother  employing  local  linear  fits  with  varying  window  width 
determined  by  local  cross-validation  ("super  smoother",  Friedman  and 


Stuetzle  [1982]). 

The  algorithm  evaluates  §*,$*,...,$*  at  all  the  corresponding  data 
values,  i.e.  §*(y)  is  evaluated  at  the  set  of  data  values  {y.}, 
k=l,...,N.  The  simplest  way  to  understand  the  shape  of  the  transforma¬ 
tions  is  by  means  of  a  plot  of  the  function  versus  the  corresponding  data 
values,  that  is,  through  the  plots  of  9*(yk)  versus  yk  and 
versus  the  data  values  of  x.j,...,x  respectively. 

In  this  section,  we  illustrate  the  ACE  procedure  by  applying  it  to 
various  data  sets.  (A  FORTRAN  subroutine  implementing  our  procedure  is 
listed  in  Appendix  2.)  In  order  to  evaluate  performance  on  finite  samples, 
the  procedure  is  first  applied  to  simulated  data  for  which  the  optimal 
transformations  are  known.  We  then  apply  it  to  the  Boston  housing  data  of 
Harrison  and  Rubinfeld  [1978]  as  listed  in  Belsey,  Kuh  and  Welsch  [1980], 
contrasting  the  ACE  transformations  with  those  used  in  the  original 
analysis. 

Our  first  example  consists  of  200  bivariate  observations  Hyk>xk)» 

1  _<^k  <_200>  generated  from  the  model 


yk  =  exp[sin(xk)  +  ek/2] 

with  the  xk  sampled  from  a  uniform  distribution  U(0,2tt)  and  the  ek 
drawn  independently  of  the  xk  from  a  standard  normal  distribution 
N(0,1).  Figure  la  shows  a  scatterplot  of  these  data.  Figures  lb-ld  show 
the  results  of  applying  the  ACE  algorithm  to  the  data.  The  estimated 
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optimal  transformation  e*(y)  is  shown  in  the  plot  lb  of  § *(y k)  versus 
yk,  1  £  k  £  200.  Figure  1c  Is  a  plot  of  $*(x^)  versus  x^.  These 
plots  clearly  suggest  the  transformati  ns  9(y)  =  log(y)  and  4>(x)  = 
sin(x)  which  are  optimal  for  the  parent  distribution.  Figure  Id  is  a 
plot  of  9*(yk)  versus  $*(xk).  This  plot  indicates  a  more  linear  rela¬ 
tion  between  the  transformed  variables  than  that  between  the  untransformed 
ones. 

The  next  issue  we  address  is  how  much  the  algorithm  overfits  the 

data  due  to  the  repeated  smoothings,  resulting  in  inflated  estimates  of 

2  2 

the  maximal  correlation  p*  and  of  R*  =  1  -  e*  .  The  answer,  on  the  simulated 
data  sets  we  have  generated,  is  surprizing  little. 

o 

To  illustrate  this,  we  contrast  two  estimates  of  p*  and  R* 
using  the  above  model.  The  known  optimal  transformations  are  e(Y)  = 
log  Y,  <j>(X)  =  sin  X.  Therefore,  we  define  the  direct  estimate  for  p* 
given  any  data  set  generated  as  above  by 


P 


1 

N 


N  _  _ 

l  (log  yt  -log  y)(sin  x.  -sin  x) 
k=l  K  K 


and 


The  ACE  algorithm  produces  the  estimates 


2*  •  i 


and  R*2  =  1  -e*2  =  p*2.  In  this  model  p*  =  .8165  and  R*2  =  .6667. 

For  100  data  sets,  each  of  size  200,  generated  from  the  above  model, 
the  means  and  standard  deviations  of  the  p*  estimates  are 


mean  s.d. 


p*  direct 
ACE 


.814 

.808 


.022 

.031 
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2 

The  means  and  standard  deviations  of  the  R  estimates  are 


mean 

s.d. 

R*2  direct 

.664 

.031 

ACE 

.654 

.050 

We  also  computed  the  differences  p*-p*  and  ft*2-^*2  for  the  100 
data  sets.  The  means  and  standard  deviations  are 


P*  -  P* 
ft  2  *  2 

6*  -  R* 


mean  s.d. 

-.006  .015 

-.010  .024 


The  above  experiment  was  duplicated  for  smaller  sample  size  N=100. 
In  this  case  we  obtain 


mean 


s.d. 


p*  -  p* 

-  2  *  2 
R*c  -  R* 


-.005 

-.007 


.027 

.044 


Our  next  example  consists  of  a  sample  of  200  triples  { ^yk*xkT  *xk2^ ’ 
1  <k£200)  drawn  from  the  model  Y  =  with  X^  and  X,,  generated 

independently  from  a  uniform  distribution  U(-l,l).  Note  that  0(Y)  = 
log(Y)  and  <h.(X.)  =  log  X.  (j  =1,2)  cannot  be  solutions  here  since  Y, 

J  J  J 

X-|  and  X2  all  assume  negative  values.  Figure  2a  shows  a  plot  of 
0*(yk)  versus  y^,  while  Figures  2b  and  2c  show  corresponding  plots  of 
$f(xki)  and  $f^xk2^  Oi^lZOO).  All  three  solution  transformation 
functions  are  seen  to  be  double  valued.  The  optimal  transformations  for 
this  problem  are  0*(Y)  =  log|Y|  and  $T(Xj)  =  log  |Xj|  ( j  =1,2).  The 
estimates  clearly  reflect  this  structure  except  near  the  origin  where  the 
smoother  cannot  reproduce  the  infinite  discontinuity  in  the  derivative. 
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For  our  final  example,  we  apply  the  ACE  algorithm  to  the  Boston 
housing  market  data  of  Harrison  and  Rubinfeld  [1978].  A  complete  listing 
of  these  data  appear  in  Belsey,  Kuh  and  Welsch  [1980].  Harrison  and 
Rubinfeld  used  these  data  to  estimate  marginal  air  pollution  damages  as 
revealed  in  the  housing  market.  Central  to  their  analysis  was  a  housing 
value  equation  which  relates  the  median  value  of  owner-occupied  homes  in 
each  of  the  506  census  tracts  in  the  Boston  Standard  Metropolitan 
Statistical  Area,  to  air  pollution  (as  reflected  in  concentration  of 
nitrogen  oxides)  and  to  12  other  variables  that  are  thought  to  effect 
housing  prices.  This  equation  was  estimated  by  trying  to  determine  the 
best  fitting  functional  form  of  housing  price  on  these  13  variables.  By 
experimenting  with  a  number  of  possible  transformations  of  the  14  varia¬ 
bles  (response  and  13  predictors),  Harrison  and  Rubinfeld  settled  on  an 
equation  of  the  form 

log  (MV )  -  <*1  +  a2(RM)2  +  «3  AGE 

+  log(DIS)  +  log(RAD)  +  ag  TAX 
+  a?  PTRATIO  +  ag(B-0.63)2 
+  oig  log(LSTAT)  +  a-jQ  CRIM  +  ZN 
+  a12  INDUS  +  a13  CHAS  +  a14(N0X)P  +  e  . 

A  brief  description  of  each  variable  is  given  in  Table  1.  (For  a  more 
complete  description,  see  Harrison  and  Rubinfeld  [1978],  Table  IV.)  The 
coefficients  alt...,a^4  were  determined  by  a  least  squares  fit  to  mea¬ 
surements  of  the  14  variables  for  the  506  census  tracts.  The  best  value 
for  the  exponent  p  was  found  to  be  2.0,  by  a  numerical  optimization 
(grid  search).  This  "basic  equation"  was  used  to  generate  estimates  for 
the  willingness  to  pay  for  and  the  marginal  benefits  of  clean  air. 


TABLE  1 


Variable 

MV 

RM 

AGE 

DIS 

RAD 

TAX 

PTRATIO 

B 

LSAT 

CRIM 

ZN 

INDUS 

CHAS 

NOX 


Variables  Used  in  the  Housing  Value  Equation 
of  Harrison  and  Rubinfeld  (1978) 

Definition 

Median  value  of  owner-occupied  homes 

Average  number  of  rooms  in  owner  units 

Proportion  of  owner  units  built  prior  to  1940 

Weighted  distances  to  five  employment  centers  in  the  Boston 
region 

Index  of  accessibility  to  radial  highways 
Full  property  tax  rate  ( $/$l 0 , 000 ) 

Pupil-teacher  ratio  by  town  school  district 
Black  proportion  of  population 
Proportion  of  population  that  is  lower  status 
Crime  rate  by  town 

Proportion  of  town's  residential  land  zoned  for  lots  greater 
than  25,000  square  feet 

Proportion  of  nonretail  business  acres  per  town 

Charles  River  dummy:  =  1  if  tract  bounds  the  Charles  River; 

=  0  if  otherwise 

Nitrogen  oxide  concentration  in  pphm 


t 
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Harrison  and  Rubinfeld  note  that  the  results  are  highly  sensitive  to  the 
particular  specification  of  the  form  of  the  housing  price  equation. 

We  applied  the  ACE  algorithm  to  the  transformed  measurements  (using 
p=2  for  NOX)  appearing  in  the  basic  equation.  To  the  extent  that  these 
transformations  are  close  to  the  optimal  ones,  the  algorithm  will  produce 
results  close  to  linear  functions  6(Z)  =4>-|(Z)  =  •••  =4>^(Z)  =  cxZ+B- 
Departures  from  linearity  indicate  transformations  that  can  improve  the 
quality  of  the  fit. 

Figure  3a  shows  a  plot  of  the  solution  response  transformation 
§*(log  y).  This  function  is  seen  to  have  a  positive  curvature  for 
central  values  of  y,  connecting  two  straight  line  segments  of  different 
slope  on  either  side.  This  suggests  that  the  logarithmic  transformation 
may  be  too  severe.  Figure  3b  shows  the  transformation  e*(y)  resulting 
when  the  ACE  algorithm  is  applied  to  the  original  untrans  former?  census 
measurements.  This  indicates  that,  if  anything,  a  very  mild  transforma¬ 
tion,  involving  positive  curvature,  is  most  appropriate  for  the  response 
variable. 

Figures  3c-3o  show  the  ACE  transformations  $*,..., for  the 
(transformed)  predictor  variables.  The  standard  deviation  a($t)  is 

J 

indicated  in  each  graph.  This  provides  a  measure  of  how  strongly  each 
<£t(x.)  enters  into  the  model  for  e*(y).  (Note  that  o(0*)  =  1.)  The 

J  J 

two  terms  that  enter  most  strongly  involve  the  number  of  rooms  (Figure  3c) 
and  the  fraction  of  population  that  is  of  lower  status  (Figure  3j).  The 
nearly  linear  shape  of  the  latter  transformation  suggests  that  the  original 
logarithmic  transformation  was  appropriate  for  this  variable.  The  trans¬ 
formation  on  the  number  of  rooms  variable  is  far  from  linear,  however, 
indicating  that  a  quadratic  does  not  adequately  capture  its  relationship 
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to  housing  value.  For  less  than  six  rooms,  housing  value  is  roughly 

independent  of  room  number,  while  for  larger  values  there  is  a  strong 

increasing  linear  dependence.  Among  the  next  three  variables  (in  order 

of  their  contribution  to  the  model),  log  DIS  (Figure  3e),  CRIM 

(Figure  3k),  and  INDUS  (Figure  3m),  only  CRIM  has  a  solution  close  to  a 

straight  line.  The  plots  for  the  remaining  variables  indicate  that 

several  of  them  could,  as  well,  benefit  from  transformations  substantially 

different  from  those  used  to  define  the  basic  equation. 

2 

The  marginal  effect  of  (NOX)  on  median  home  value,  as  captured  by 

A  2 

this  model,  can  be  investigated  by  studying  $*[(N0X)  ]  in  Figure  3o. 

2 

This  curve  is  a  nonmonotonic  function  of  NOX  not  well  approximated  by 
a  linear  (or  monotone)  function.  This  makes  it  difficult  to  formulate  a 
simple  interpretation  of  the  willingness  to  pay  for  clean  air  from  these 
data.  For  low  concentration  values,  housing  prices  seem  to  increase  with 
increasing  (NOX)  ,  whereas  for  higher  values  this  trend  is  substantially 
reversed. 

13~ 

Figure  3p  shows  a  scatterplot  of  0* (y.  )  versus  J  $i(x.  .).  This 

K  j=l  J  KJ 

plot  shows  no  evidence  of  additional  structure  not  captured  in  the  model 

13 

§*(y)  =  I  $T(x.)  +  e  . 
j=l  J  J 

~  2 

The  e*  resulting  from  the  use  of  the  ACE  transformations  was  0.11  as 
2 

compared  to  the  e  value  of  0.20  produced  by  the  Harrison  and  Rubinfeld 
[1978]  transformations. 


i 
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Figure  3j 
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4.  Discussion 

The  ACE  algorithm  provides  a  fully  automated  method  for  estimating 
optimal  transformations  in  multiple  regression.  It  also  provides  a 
method  for  estimating  maximal  correlation  between  random  variables.  It 
differs  from  other  empirical  methods  for  finding  transformations  (Box  and 
Tidwell  [1962];  Anscombe  and  Tukey  [1963];  Box  and  Cox  [1964];  Kruskal 
[1965];  Draper  and  Cox  [1969];  Fraser  [1967];  Linsey  [1972];  Box  and  Hill 
[1974];  Linsey  [1974];  Wood  [1974];  Mosteller  and  Tukey  [1977];  and  Tukey 
[1982])  in  that  the  "best"  transformations  of  the  response  and  predictor 
variables  are  unambiguously  defined  and  estimated  without  use  of  ad  hoc 
heuristics,  restrictive  distributional  assumptions,  or  restriction  of  the 
transformation  to  a  particular  parametric  family. 

The  algorithm  is  reasonably  computer  efficient.  On  the  Boston  housing 
data  set  comprising  506  data  points  with  14  variables  each,  the  run  took 
12  seconds  of  CPU  time  on  an  IBM  3081.  Our  guess  is  that  this  translates 
into  2.5  minutes  on  a  VAX  11/750  with  FP.  To  extrapolate  to  other  problems, 
use  the  estimate  that  running  time  is  proportional  to  (number  of  variables) 
x  (sample  size). 

A  strong  advantage  of  the  ACE  procedure  is  the  ability  to  incorporate 
variables  of  quite  different  type  in  terms  of  the  set  of  values  they  can 
assume.  The  transformation  functions  e (y )  ,<j>^  (x1 ) , . . .  ,$p(xp)  assume 
values  on  the  real  line.  Their  arguments  can,  however,  assume  values  on 
any  set.  For  example,  ordered  real,  periodic  (circularly  valued)  real, 
ordered  and  unordered  categorical  variables  can  be  incorporated  in  the 
same  regression  equation.  For  periodic  variables,  the  smoother  window 
need  only  wrap  around  the  boundaries.  For  categorical  variables,  the 
procedure  can  be  regarded  as  estimating  optimal  scores  for  each  of  their 
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values.  (The  special  case  of  a  categorical  response  and  a  single  cate¬ 
gorical  predictor  variable  is  known  as  canonical  analysis— see  Kendall 
and  Stuart  [1967],  p.568— and  the  optimal  scores  can,  in  this  case,  also 
be  obtained  by  solution  of  a  matrix  eigenvector  problem.) 

In  some  problems  the  analyst  may  wish  to  restrict  §(y)  to  be 
monotone.  For  example,  §(y)  monotone  allows  the  unique  specification  of 
y  given  a  value  of  §(y).  There  is  an  option  in  the  program  (Appendix  2) 
that  allows  this  using  Kruskal's  method  of  finding  closest  monotone  fits 
(see  Kruskal  [1964],  pp.  126-128).  However,  we  advise  running  the  regular 
algorithm  first,  since  lack  of  monotonicity  in  §*(y)  can  provide 
valuable  insight  into  the  structure  of  the  data. 

The  solution  functions  §*(y)  and  $*(x^ ) , . . .  ,S*(xp)  can  be  stored 
as  a  set  of  values  associated  with  each  observation  (yk »xk-j » •  •  •  »xkp) » 
l£k<N.  However,  since  9(y)  and  4>(x)  are  usually  smooth  (for 
continuous  y,  x),  they  can  be  easily  approximated  and  stored  as  cubic 
spline  functions  (deBoor  [1978]) with  a  few  knots. 

As  a  tool  for  data  analysis,  the  ACE  procedure  provides  graphical 

output  to  indicate  a  need  for  transformations,  as  well  as  to  guide  in  their 

choice.  If  a  particular  plot  suggests  a  familiar  functional  form  for  a 

transformation,  it  can  be  substituted  for  the  empirical  transformation 

estimate  and  the  ACE  algorithm  can  be  rerun  using  an  option  which  alters 

only  the  scale  and  origin  of  that  particular  transformation.  The  resulting 
? 

e  can  be  compared  to  the  original  value.  We  have  found  that  the  plots 
themselves  often  give  surprising  new  insights  into  the  relationship  between 
the  response  and  predictor  variables. 

As  with  any  regression  procedure,  a  high  degree  of  association  between 
predictor  variables  can  sometimes  cause  the  individual  transformation 
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estimates  to  be  highly  variable  even  though  the  complete  model  is  reasonably 
stable.  When  this  is  suspected,  running  the  algorithm  on  randomly  selected 
subsets  of  the  data,  or  on  bootstrap  samples  (Efron  [1979])  can  assist  in 
assessing  the  variability. 

The  ACE  method  has  generality  beyond  that  exploited  here.  An  imme¬ 
diate  generalization  would  involve  multiple  response  variables  Y-j . Y  . 

The  generalized  algorithm  would  estimate  optimal  transformations 

*  ★  *  ★ 

©I ... .,e  ,$i ,. .. ,$  that  minimize 

E'Et.iVV  -  ^W'2 

subject  to  Ee^  =  0,  £=1,...,q,  E<pj=0,  j*l ....  ,p 
and  Hj=1ei(Y£)l2  =  1  . 

This  extension  generalizes  the  ACE  procedure  in  a  sense  similar  to 
that  in  which  canonical  correlation  generalized  linear  regression. 

The  ACE  algorithm  (Section  2)  is  easily  modified  to  incorporate 
this  extension.  An  inner  loop  over  the  response  variables,  analagous 
to  that  for  the  predictor  variables,  replaces  the  single  function 
minimization. 
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5.0  Optimal  Transformations  in  Function  Space 
Introduction 

Define  random  variables  to  take  values  either  in  the  reals  or  in  a 
finite  or  countable  unordered  set.  Given  a  set  of  random  variables 
Y,Xj»...,Xp,  a  transformation  is  defined  by  a  set  of  real  valued 
measurable  functions  (0,<|>j, . . .  ,$  )  =  (0,£),  each  function  defined  on  the 
range  of  the  corresponding  random  variables,  such  that 

(5.1)  E9(Y)  =  0  ,  E*j(Xj)  =  0  ,  j  =1 . p 

E02(Y)  <  -  ,  E 4>j ( X j )  <»  ,  j=l,...,p 

Use  the  notation 

(5.2)  m  =  f  ♦j(Xj) 

Denote  the  set  of  all  transformations  by  F. 

(5.3)  DEFINITION.  A  transformation  (0*»£*)  is  optimal  for  regression 
if  E(0*)2  =  1,  and 

e*2  =  E[0*(Y)-$*(X)]2  =  inf  {E[0(Y)-*(X)]2 ;E02=1 } 

(5.4)  DEFINITION.  A  transformation  (0**,£**)  is  optimal  for 

correlation  if  E(0**)2  =  1,  E($**)^  =  1, 

p*  =  E[9**(Y)$**(X)]  -  SUP  (E[e(Y)$(X)]2;  E(*)2  = 1,E02=1 } 

(5.5)  THEOREM.  If  ( 0 **,£**)  is  optimal  for  correlation,  then  0*  =  9**, 

is  optimal  for  regression  and  conversely .  Furthermore 


J 
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PROOF.  Write 

E(e-$)2  =  1  -  Ee|  +  U2 

=  l  -  2E(6$1)^r  +  E02 

where  $  *  Hence 

(5.6)  E(9-$)2  >  1  -  2pVE^  +  E$2 

with  equality  only  if  E9$  =  p*.  The  minimum  of  the  right  side  of  (5.6) 

o  o  o  2 

over  E<()  is  at  E<j>  =  (p*)  where  it  is  equal  to  l-(p*)  .  Then 

0  o 

(e  r  *  1  -  (p*)  and  if  (9**,^**)  is  optimal  for  correlation,  then 
q*  s  e**,  q*  =  p*<j>**  is  optimal  for  regression.  The  argument  is 
reversible. 


5.1  Existence  of  Optimal  Transformations 

To  show  existence  of  optimal  transformations,  two  additional 
assumptions  are  needed: 

AI.  There  is  no  non-zero  eet  of  functions  satisfying  (5.1)  such  that 

8(Y)  +  9 j ( X j )  =0  a.s. 

To  formulate  the  second  assumption,  define 

(5.7)  DEFINITION.  Define  the  Hilbert  spaces  H2 ( Y )  ^(X^) , . . .  ^(X  ) 
as  the  sets  of  functions  satisfying  (5.1)  with  the  usual  inner  product , 
i.e.,  H2 ( X j )  is  the  set  of  all  measurable  <f>j  such  that  E4>j(Xj)  =  0, 
E<*j(Xj)  <  <*>  with  (♦J,^)  *  E[4>j(Xj)d>J(Xj)]. 
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AII.  The  conditional  expectation  operators 

E(4>j(Xj)  IY) :  H2(Xd)  -v  H2(Y)  , 

E(<i>j(Xj)|Xi):  H2(Xd)  —  H2(X.)  ,  i 
E(6(Y)jXj):  H2(Y)  —  H2(Xj) 

are  all  compact. 

Condition  All  is  satisfied  in  most  cases  of  interest.  A  sufficient 
condition  is  given  by:  let  X,  Y  be  random  variables  with  joint  density 
fx  y  and  marginals  fx>  fy.  Then  the  conditional  expectation  operator  on 
H2(Y)  — ►  H2( X)  is  compact  if 

(5.8)  |  [f^y/f^fyjdxdy  <  °° 

(5.9)  THEOREM.  Under  AI  and  All  optimal  transformations  exist. 

Some  machinery  is  needed. 

(5.10)  PROPOSITION.  The  set  of  all  functions  f  of  the  form 

f(Y,x)  =  9(Y)  +  I-j0j(Xj)  ,  e  g  H2(Y),  g  h2(xj.) 

with  the  inner  product  and  norm 

(g.f)  =  E[gf]  ,  (Ifl2  =  Ef2  , 

is  a  Hilbert  space  denoted  by  H2>  The  subspace  of  all  functions  <j>  of 

the  form 

♦(X)  *  Zi^(Xj)  ,  <t>jGH2(V 

te  a  closed  linear  subspace  denoted  by  H2(X).  So  are  H2(Y)»H2(Xj) ,. . . ,H2(Xp) . 
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(5.10)  follows  from: 

(5.11)  PROPOSITION.  Under  AT,  All  there  are  constants  0  <  £  Cg  <  00 

such  that 

C;l(iei2  +i;i4>ji2)  £  19  +Ii0ji2  £  c2(iei2  . 

PROOF.  The  right  hand  inequality  is  immediate.  If  the  left  side  does 
not  hold,  we  can  find  a  sequence  fp  =  0n  +  J<j>n  j  such  that 

9  9  9 

ienl  +EllV j1  =  **  l)ut  There  is  a  subsequence  n'  such 

that  9n,  e,  4>n , j  4>j  in  the  sense  of  weak  convergence  in 
^(Yj.HgfXj), .. . »H2(Xp)  respectively. 

Wri  te 


E£Vj<Xj>Vl<x1»  -  EUn.J(Xj)E(*n.1{Xi)|*j)3 


to  see  that  All  implies  E<p  , i4>  ,  -  — ►Eq>i4>. ,  i  f  j  and  similarly  for 

E0.<{>n,  Furthermore  !<{>.fl  <  lim  II<J>  .J,  B9B  <  lim  I0.I.  Thus, 

n  n  »j  1  * -  n  i  —  — —  n 

defining  f  *  0  +  Zj'Pj 

Ifl2  =  16  <  ljm  Ifn,|2  =  0 

which  implies,  by  AI,  that  6  *  s  •••  =4*^  =  0.  On  the  other  hand. 


if  l2  =  le  ,2 
n 


Hence,  if  f  =  0,  then  l_im  lfpr  >  1. 


(5.12)  COROLLARY.  If  f 


f  in  H9,  then  0„ 


W 


♦nj  — *  <t»j  in  H2(Xj),  j  *l,...,p,  and  conversely. 

PROOF.  If  fn  =  9n 0 then  by  (5.11), 
TTm  I4»njl  <  ®.  Take  n'  such  that  0n.  0',  4>n , ^ 


in  H2(Y), 


TTm  ienl  <  », 
►  $J,  and  let 

J 
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f‘  =  e'  +Ij4,j.  Then  for  any  9  G  H2’  — *-(g*»ir,)»  so  (g,f)  = 

(g,f')  all  g.  The  converse  is  easier. 

(5.13)  DEFINITION.  In  H 2 >  Pyj  Pj  ar^d  P^  denote  the  projection 

operators  on  H2(Y),  H2(Xj)  an^  ^(X)  respectively . 

On  H2(X^),  Pj,  j  t  i,  is  the  conditional  expectation  operator,  and 
similarly  for  other  subspaces. 

(5.14)  PROPOSITION.  Py  is  compact  on  H2(X) — ►H^Y)  and  P^  is  compact 
on  H2(Y)  ->H2(X). 

PROOF.  Take  $n  6  H2(X),  This  implies,  by  (5.12),  that 

<t>n  j  <t>j.  By  All,  Py<j)n  j  -$-+  Py4> j  so  that  Py$n  Pyt>.  Now  take 

0  e  H2(Y),  4>€H2(X),  then  (0,Py<+>)  =  (0,<j>)  =  (Px6,$).  Thus, 

Py  H2(Y)— +H2(X)  is  the  adjoint  of  Py  and  hence  compact. 

Now  to  complete  the  proof  of  Theorem  5.9.  Consider  the  functional 

^  2  ~  9  ^ 

ie-<t>n  on  the  set  of  all  (0,<j>)  with  lei  =  1.  For  any  0,  <j> 

n 0-$n 2  >  ii0-px0n2 

2  2 

If  there  is  a  0*  which  achieves  the  minimum  of  R 0-Px© B  over  II e D  =  1, 

p 

then  an  optimal  transformation  is  0*,  Px©*.  On  II 0 II  -  1 

10-px0H2  =  1  - 11  p xe h 2  . 

Let  I  =  {sup  IIPx0i  ;  I0i  =1).  Take  '■  such  that  H 9nR 2  =  1,  0p  0, 

and  lPx0nl— *-s.  By  the  compactness  of  °x,  I Px0nB  — *- 1 Px©  1  =  5. 

Further,  101  <  1.  If  lei  <  1,  then  for  0'=  0/101,  we  get  the 
contradiction  I P^e ' I  >  5.  Hence  lei  =  1  and  (0.  Pxe)  is  an  optimal 
transformation. 
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5.2  Characterization  of  Optimal  Transformations 

Define  two  operators  U:  ^(Y)  — ^(Y)  and  V:  H^CX)  — by 

U0  =  PyPx6  ,  V$  =  PxPy$ 

(5.15)  PROPOSITION.  U  and  V  are  compact ,  self-adjoint  and  non-negative 
definite.  They  have  the  same  eigenvalues  and  there  is  a  1-1  correspondence 
between  eigenspaaes  for  a  given  eigenvalue  specified  by 

$  *  PX0/llPX6fi  ,  0  =  Pyt/IPyfl 

PROOF.  Direct  verification. 

Let  the  largest  eigenvalue  be  denoted  by  X,  X  =  Hull  =  II V II .  Then 

(5.16)  THEOREM,  if  0*,  $*  is  an  optimal  transformation  for  regression , 
then 

X0*  «  U0*  ,  X$*  =  V$* 

Conversely y  if  0  satisfies  X0  =  U0,  II 01  -  1,  then  0,  Px0  is  optimal 

for  regression.  If  $  satisfies  X$  =  V$,  then  0  »  Py$/IIPy$ll,  and 
X$/IIPy<j>ll  are  optimal  for  regression.  In  addition 

(e*)2  =  1-x  . 

PROOF.  Let  0*,  $*  be  optimal.  Then  $*  =  Px0*.  Write 

ae*-$*n2  =  i  -2(e*,$*)  +  ii$*ii2 

Note  that  (0*,$*)  =  (e*,Py$*)  <  8Py$*ll  with  equality  only  if 

6*  =  ePy$*,  e  constant.  Therefore,  0*  =  Py$*/l|Py$*ll .  This  implies 

I Py$*l 0*  =  U6*  ,  IPy$*S$*  =  V$*  , 
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so  that  IPy<j>*l  is  an  eigenvalue  X*  of  U,  V.  Computing  gives 

=  1  -  X*.  Now  take  0  any  eigenfunction  of  U  corresponding  to 
X,  with  lei  =  1.  Let  $  =  P^9,  then  8e-$l  =  1-X.  This  shows  that 
0*,  are  not  optimal  unless  X*  =  X.  The  rest  of  the  theorem  is 
straightforward  verification. 

(5.17)  COROLLARY.  If  X  has  multiplicity  one,  then  the  optimal  trans¬ 
formation  is  unique  up  to  a  sign  change.  In  any  case,  the  set  of  optimal 
transformations  is  finite  dimensional. 

It  appears  that  uniqueness  is  the  general  case- 


5.3  Alternating  Conditional  Methods 

Direct  solution  of  the  equations  X0  =  U0  or  X$  =  V<j>  is 

formidable.  Attempting  to  use  data  to  directly  estimate  the  solutions 

is  just  as  difficult.  In  the  bivariate  case,  if  X,  Y  are  categorical, 

then  X0  =  U0  becomes  a  matrix  eigenvalue  problem  and  is  tractable.  This 

is  the  case  treated  in  Kendall  and  Stuart  [1967]. 

The  ACE  algorithm  is  founded  on  the  observation  that  there  is  an 

iterative  method  for  finding  optimal  transformations.  We  illustrate  this 

2 

in  the  bivariate  case.  The  goal  is  to  minimize  B 6 ( Y ) -4>(X ) D  with 
lei2  *  1.  Denote  Px0  =  E(ejX),  Py$  =  E($>|Y).  Start  with  any  first 

guess  function  9q(Y).  Then  define  a  sequence  of  functions  by 

=  PX60 

=  WW 
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and  in  general  *n+l  =  PX6n’  9n+l  =  PY4>n+l/,PY4>n+ll  *  It  is  clear  that 

2 

at  each  step  in  the  iteration  |0-<j>|  is  decreased.  It  is  not  hard  to 
show  that  in  general,  8n,  $n  converge  to  an  optimal  transformation. 

The  above  method  of  alternating  conditionals  extends  to  the  general 
multivariate  case.  The  analogue  is  clear;  given  0n,  $  ,  then  the  next 
iteration  is 

Vl  =  Pxen  ’  0n+l  =  Vn+l^Vn+11 

However,  there  is  an  additional  issue:  How  can  Px0  be  computed  using 
only  the  conditional  expectation  operators  P.,  j  =1 . p?  This  is  done 

J 

by  starting  with  some  function  4>q  and  iteratively  subtracting  off  the 
projections  of  0-$n  on  the  subspaces  H2(X1) ,. . . ,H2(Xp)  unt1l  we  get  a 
function  <j>  such  that  the  projection  of  8-$  on  each  of  H?(X.)  is  zero. 
This  leads  to 

The  Double  Loop  Algorithm 
The  Outer  Loop 

1.  Start  with  an  initial  guess  0q{Y). 

2-  Put  *n+l  =  PX6n*  en+l  =  PyVi/IPyW  and  rePeat  until 
convergence. 

Let  P£0q  be  the  projection  of  0Q  on  the  eigenspace  E  of  U 
corresponding  to  X.  Then 

(5.18)  THEOREM.  If  IP^OqI  t  0,  define  an  optimal  transformation  by 

0* '  pe0o/ipe8o1,  **  =  Px60*  'V6*1  “V**1-*0- 

PROOF.  Notice  that  ©n+1  *  U0n/IU0nl.  For  any  n,  0n  *  an0*+9n»  where 
gnlE.  Because,  if  it  is  true  for  n,  then 
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9n+l  =  <Ve*+u«n>/,Ve*+l*nl 

and  Ugn  Is  1  to  E.  For  any  giE,  lUgl  <  rlgl  where  r  <  A.  Since 

“n+1  '  *VIUV-  Vl  *  V*.1'  the" 

,W,Vl  =  “VK  i  (r/X)lg„«/a„  • 

Thus  I g  I / atn  £  c(r/X)n.  But  16  II  =  1,  ap+lg  1^  =  1,  implying 
2 

a  — *-1.  Since  an  >  0,  then  a  >  0,  so  a  — *-1.  Now  use  10-0*1  = 
n  u  n  n  n 

(1-a)  +lg„IT  to  reach  the  conclusion.  Since  10...  1-0*1  *  iPv0..-Pv6*l 
n  Jn  Tn+i  x  n  x 

<  l0n-0*l,  the  theorem  follows. 

The  Inner  Loop 

1.  Start  with  functions  0,  0g. 

2.  If,  after  m  stages  of  iteration,  the  functions  are  then 

J 

define,  for  j  *  1,2 . p, 

*(im+1)  -  P ,(0  -  l  4m)-  I  4>jm+1)) 

J  0  i>j  J  i<j  1 

(5.19)  THEOREM.  Let  0m  =  ,PX0AB  _>0* 

PROOF.  Define  the  operator  T  by 

T  =  (t-Pp)(I-Pp.1)-U-p1) 

Then  the  iteration  in  the  inner  loop  is  expressed  as 

(5.20)  9-Vj  ■  T(8-*m) 

-  l*(e-}0) 

Write  0-0Q  =  0  -  P^0 +P^0 -0g.  Noting  that  T(0-Px@)  =  0-P^e,  (5.20) 
becomes 

Vi  ■  px8  +Tm(Pxe-$0) 


J 
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The  theorem  is  then  proven  by 

(5.21)  PROPOSITION.  For  any  $  €  H^X),  ll"$l  —  0. 

PROOF.  |(I-P.)$I2  =  HI2  -  IP -$B2  <  J$n2.  Thus  IT*  <  1.  There  is  no 
J  J 

<j>  f  0  such  that  IT$I  =  l$l.  If  there  were,  then  |P.$I  =  0,  all  j. 

J 

Then  for  <j>'  * 

J 

(*.$')  =  Ij(Mj)  =  =  0 

J 

The  operator  T  =  I  +  W,  where  W  is  compact*  Now  we  claim  that 
|T,nWI  — *-0  on  H2(X).  To  prove  this,  let  0  <  y  <  1  and  define 

G(y)  =  sup  {I1U$I/IH$I;  IW$1  >y)  . 

$ 

Take  £  1,  IWij^ll  _>  y  so  that  IITW^H/IIW^SI  — *-G(y) .  Then 

l$l  <  1,  IW$I>y  and  G(y)  »  ITW$I/IW$I.  Thus  G(y)  <  1,  for  all 
0  <  y  <  1  and  is  clearly  non- increasing  in  y*  Then 

=  ITWT,n'1$l  <  G( IT,n_1W$l ) ITm"1W$»  . 

Put  yq  =  8WS ,  Ym  =  G(Ym.1)Yn.1»  then  if’Vli  <  ym.  But  clearly  Ym~+0- 
The  range  of  W  is  dense  in  H2(X).  Otherwise,  there  is  a  <j>'  f  0 
such  that  (<j>',W4>)  =  0,  all  $.  This  implies  (W*|j)',$)  =  0  or  W*<j>'  =  0. 
Then  IT*<j>'l  =  lijS'l  and  a  repetition  of  the  argument  given  above  leads 
to  <j>'  =  0.  For  any  $  and  e  >  0,  take  so  that  I<j>-W<j>jl  £  e. 

Then  ITm<j>l  ^e+lT^W^S,  which  completes  the  proof. 

There  are  two  versions  of  the  double  loop.  In  the  first,  the  initial 
functions  ^  are  the  limiting  functions  produced  by  the  preceding 
inner  loop.  This  is  called  the  restart  version.  In  the  second,  the 


Initial  functions  are  <j>0  h  0.  This  is  the  fresh  start  version.  The 
main  theoretical  difference  is  that  a  stronger  consistency  result  holds 
for  fresh  start.  Restart  is  a  faster  running  algorithm,  and  is  embodied 
in  the  ACE  code. 

The  Single  Loop  Algorithm 

The  single  loop  algorithm  combines  a  single  iteration  of  the  inner 
loop  with  an  iteration  of  the  outer  loop.  Thus,  it  is  summarized  by 

1.  Start  with  6q»  $g  *  °. 

2.  If  the  current  functions  are  @n,  ,  define  4>n+1  by 

VVl  '  T<V*r,> 

3.  Let  6n+1  =  Py^n+^"PY^n+l* '  Run  to  conver9ence- 

This  is  a  cleaner  algorithm  than  the  double  loop  and  its  implementa¬ 
tion  on  data  runs  at  least  twice  as  fast  as  the  double  loop  and  requires 
only  a  single  convergence  test.  Unfortunately,  we  have  been  unable  to 
prove  that  it  converges  in  function  space.  Assuming  convergence,  it  can 
be  shown  that  the  limiting  0  is  an  eigenfunction  of  U.  But  giving  condi 
tions  for  0  to  correspond  to  X  or  even  showing  that  0  will  correspond 
to  X  "almost  always"  seems  difficult. 
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6.0  The  ACE  Algorithm  on  Finite  Data  Sets 
Introduction 

The  ACE  algorithm  is  implemented  on  finite  data  sets  by  replacing 
conditional  expectations,  given  continuous  variables,  by  data  smooths. 

In  looking  at  the  convergence  and  consistency  properties  of  the  ACE 
algorithm,  the  critical  element  was  the  properties  of  the  data  smooth 
used.  The  results  are  fragmentary.  Convergence  of  the  algorithm  is  proven 
only  for  a  very  restricted  class  of  smooths.  In  practice,  in  over  1000 
runs  of  ACE  over  a  wide  variety  of  data  sets  and  using  three  different 
types  of  smooths,  we  have  seen  only  one  instance  of  failure  to  converge. 

A  fairly  general,  but  weak,  consistency  proof  is  given.  We  conjecture  the 
form  of  a  stronger  consistency  result. 


6.1  Data  Smooths 

Define  a  data  set  D  to  be  a  set  {x^-.-.x^}  of  N  points  in  p 
dimensional  space,  i.e.  xk  =  (xkl,...,xkp).  Let  VN  be  the  collection 
of  all  such  data  sets.  For  fixed  D,  define  F(x)  as  the  space  of  all 
real-valued  functions  <t>  defined  on  D,  i.e.  $  e  F(x)  is  defined  by 
the  N  real  numbers  (<1>(x1),...,4>(xn)}.  Define  F(xj),  j=l,...,p  as 
the  space  of  all  real-valued  functions  defined  on  the  set  ^xij»x2j»*  *  *  »xnj ^  * 


(6.1)  DEFINITION.  A  data  smooth  S 
S:  F(x)  — >F(xj)  defined  for  every 
corresponding  element  in  F{ Xj  )  by 


of  X  on  X.  is  a  mapping 
J 

D  in  V If  <j>  €  F ( x )  denote  the 
S(<J>|Xj)  and  its  values  by  S ( 4> |  xkj)* 


Let  x  be  any  one  of  Xj,...,xp.  Some  examples  of  data  smooths  are 
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1.  Histogram:  Divide  the  real  axis  up  into  disjoint  intervals  {I^}.  If 
xk  e  l£,  define 


S(4>  |  xk>  = 


£ 


2.  Moving  Average:  Fix  M  <  N/2.  Order  the  xk  getting  x1<X2<-**<xN 
(assume  no  ties),  and  corresponding  <j>(x1),...,<j)(xn).  Put 

s(4>lxk) -  m  mL-M 

mfO 

If  M  points  are  not  available  on  one  side,  make  up  the  deficiency  on  the 
other  side. 


3.  Kernel :  Take  K(x)  defined  on  the  reals  with  maximum  at  x  =  0.  Then 

sw*k)  *  i  i'(im)«vxk)/5  K(vV 

m  a 

4.  Regression:  Fix  M  and  order  xk  as  in  (2)  above.  At  xk>  regress 
the  values  of  <J>(xk_M) , . . .  ><t>(xk+M)  excluding  d>( xfc)  on  xk_M, . . .  ,xk+M 
excluding  xk,  getting  a  regression  line  L(x).  Put  S(<l>|xk)  =  L(xk). 

If  M  points  aia  not  available  on  each  side  of  xk  make  up  the  deficiency 
on  the  other  side. 

5.  Supersmoother:  See  Friedman  and  Stuetzle  [1982]. 

Some  properties  that  are  relevant  to  the  behavior  of  smoothers  are  given 
below.  These  properties  hold  only  if  they  are  true  for  all  D  e  T>n. 

Linearity.  A  smooth  is  linear  if 

S(a<$j  +  3d>2 )  =  +  6S4>2 


for  all  <t>j,  <t>2  e  F(x)  and  all  constants  a,  8. 


Constant  Preserving.  If  <j>  G  F(x)  is  constant,  <j>  =  c,  then  S4>  =  c. 

To  give  a  further  property,  introduce  the  inner  product  (  )N  on 
F(x)  defined  by 

U.<t>')N  =  l  Ik  <Kxk)4>’(xk) 

and  the  corresponding  norm  I  8^. 

Boundedness.  S  is  bounded  by  M  if 

IS*IN  <  MI*In  ,  all  4>  G  F(x) 

In  the  examples  of  smooths  given  above,  all  are  linear,  except 
supersmoother.  This  implies  they  can  be  represented  as  an  N*N  matrix 
operator  varying  with  D.  All  are  constant  preserving.  Histograms  and 
moving  average  are  bounded  by  one.  Regression  is  unbounded  due  to  end 
effects,  but  in  the  appendix  we  introduce  a  modified  regression  smooth 
that  is  bounded  by  2.  Supersmoother  is  bounded  by  2.  The  bound  for 
kernel  smooths  is  more  complicated. 


6.2  Convergence  of  ACE 

Let  the  data  be  of  the  form  (yk,xk)  =  (yk»xkl . xkp), 

Assume  that  y  *  x^  * •••  =  xp  *  0.  Define  smooths  Sy,Sj,...,Sp  where 
sy:  F(y»?)  — *F(y)  and  S^:  F(y,x)  -+F(Xj) .  Let  H2(y,x)  be  the  set  of 
all  functions  in  F(y,x)  with  zero  mean  and  H2(y),  H2(xj)  the  corres¬ 
ponding  subspaces. 

It  is  essential  to  modify  the  smooths  so  that  the  resulting  functions 


have  zero  means.  This  is  done  by  subtracting  the  mean;  thus  the  modified 
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S,  is  defined  by 

w 

(6.2)  Sj4>  =  Sj(,)'^ 


Henceforth,  we  use  only  modified  smooths  and  assume  the  original  smooth 
to  be  constant  preserving,  so  that  the  modified  smooths  take  constants 
into  zero. 

The  ACE  algorithm  is  defined  by 
1.  e<°>(  ,K)  =  yk,  *<°>(xkJ)  -i  0. 


The  Inner  Loop 


2.  At  the  n  stage  of  the  outer  loop,  start  with 

J 

For  every  m  >_  \  and  j  =  1,...  ,p  define 

>«)  ,  s  (6<n)  .  j  >H)  .  j  >)) 

J  J  i<j  i>j 

Keep  increasing  m  until  convergence  to 

J 


The  Outer  Loop 


3. 


Set  e^n+1)  =  go  back  to  the  inner  loop 

with  <t>^  =  4>..  Continue  until  convergence. 

J  J 


To  formalize  this  algorithm,  introduce  the  space  H2 ( 0 »4>)  with 
elements  (9 ,g>^, . . . »4>p) *  9  e  H2(y).  and  subspaces  H2(e) 

with  elements  (9,0,0, ...  ,0)  =  0  and  H2($)  with  elements  (0,d>^,. . . ,4>p) 
*  $• 

For  f  =  (fo»fi . fp)  in  H2(e,$)  define  Sy.  H2(e,$)  -*H2(9,$) 

by 

<S/>i  = 

J  J  i?j 


0  ,  j  ?M 

f,-  +s ,.(  y  f4)  ,  j  =  i 


ii 
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Starting  with  0  =  (0,0,0, ...  ,0) ,  =  (0,<J>^,...,4)^)  one  complete 

cycle  in  the  inner  loop  is  described  by 

(6.3)  §-$(m+1)  =  (i-sp)(i-sp_1)-..(i-s1)(e-$(n0)  . 

Define  T  on  ^(O.^)  — ^(Q.^)  as  the  product  operator  in  (6.3).  Then 

(6.4)  =  0  -fte-^01)  . 

If,  for  a  given  0,  the  inner  loop  converges,  then  the  limiting  $ 
satisfies 

(6.5a)  Sj(0-$)  =  0  ,  j  =l,...,p  . 

That  is,  the  smooth  of  the  residuals  on  any  predictor  variable  is  zero. 
Adding 


(6.5b)  0  =  Sy$/llSy$ttN 

to  (6.5a)  gives  a  set  of  equations  satisfied  by  the  estimated  optimal 

transformations. 

Assume,  for  the  remainder  of  this  section,  that  the  smooths  are 
linear.  Then  (6.5a)  can  be  written  as 

(6.6)  =  S.Q  ,  j  = l,...,p  . 

Let  sp(S.)  denote  the  spectrian  of  the  matrix  S..  Assume  1  %  sp(S.). 

w  J  J 

(For  constant  preserving  smooths  1  is  always  in  the  spectrum,  but  not 
for  modified  smooths.)  Define  matrices  A-  by  A.  =  S.(I-S.)_1  and 

J  J  J  J 

the  matrix  A  as  T.A..  Assume  further  that  -1  ^  sp(A).  Then  (6.6) 

J  J 

has  the  unique  solution 

(6.7)  ♦j  ■  Ajd+A)-1©  ,  j=l . p 
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The  element  $  =  - . .  »4>p)  given  by  (6.7)  will  be  denoted  by  P6. 

Rewrite  (6.3)  using  ( I-T) (6-P9)  =  0  as 

(6.8)  ^  =  pe  -^(pe-^0^) 

Therefore,  the  inner  loop  converges  if  it  can  be  shown  that  fY  — *-0  for 
all  f  £  H2(<}>).  What  we  can  show  is 

(6.9)  THEOREM.  If  det[I+A]  f  0  and  if  the  spectral  radii  of 

are  all  less  than  one ,  a  necessary  and  sufficient  condition  for  n-o 
for  all  f  6  H2(<t>)  is  that 

P  1 

(6.10)  det[XI  -n(I-S./X)  1(I-S.)] 

2  J  J 

have  no  zeroes  in  |  A  (  1  except  A  =  1. 

PROOF.  For  ^f— *0,  all  f  GH^j),  it  is  necessary  and  sufficient 
that  the  spectral  radius  of  T  be  less  than  one.  The  equation  Tf  =  Af 
in  component  form  is 

(6.11)  Af.  =  -S  .( A  I  f,  +  l  f.)  ,  j  =1 . p  . 

J  i<0  i>j 

Let  s  =  |f^  and  rewrite  (6.11)  as 

(6.12)  (Al-S.)f.  =  S .(( 1-A)  l  f.  -s)  . 

j  j  J  i  <  j 

If  A  *  1,  (6.12)  becomes  ( I-S  .)f .  =  -S.s  or  s  =  -As.  By  assumption, 

J  J  J 

this  implies  s  =  0,  and  hence  f.  =  0,  all  j.  This  rules  out  A  =  1 

J 

A 

as  an  eigenvalue  of  T.  For  A  f  1,  but  A  greater  than  the  maximum  of 

the  spectral  radii  of  the  S.,  j  =l,...,p,  define  g.  =  (1-A)  £  f,-s. 

J  J  i<j  1 

Then  fj  =  (gJ+1-gj )/( 1-x) ,  so 

(JI-Sj)(9j.l-9j)  ' 
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or 

(6.13)  9J+1  -  (I-Sj/xr'd-Sjlgj  . 

Since  gp+1  =  -Xs,  gj  =  -s,  then  (6.13)  leads  to 

(6.14)  Xs  =  (I-Sp/X)-1(I-Sp) - • -(I-S1/X)"1(l-S1)s 

If  (6.14)  has  no  non-zero  solutions,  then  s  =  0,  g.  =  0,  j  =1 . p, 

J 

implying  all  f.  =  0.  Conversely,  if  (6.14)  has  a  solution  s  0,  it 
leads  to  a  solution  of  (6.11). 

Unfortunately,  condition  (6.10)  is  difficult  to  verify  for  general 
linear  smooths.  If  the  S.  are  self-adjoint,  non-negative  definite, 

J 

such  that  all  elements  in  the  unmodified  smooth  matrix  are  non-negative, 
then  all  spectral  radii  of  S.  are  less  than  one,  and  (6.10)  can  be  shown 

J 

to  hold  by  verifying  that 

P  -1 

|x|  <  nil ( i-s ./X)  A( I-S .) H 

l  J  J 

has  no  solutions  X  with  | X j  >  1,  and  then  ruling  out  solutions  with 
|X|  -  1. 

The  only  common  type  of  smooth  satisfying  the  above  conditions  is 
the  histogram  smooth,  a  poor  smooth  to  use  in  implementing  ACE. 

A 

Assuming  that  the  inner  loop  converges  to  PO  ,  then  the  outer  loop 
iteration  is  given  by 


Put  the  matrix  SyP  =  U,  so  that 
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(6.15) 


(ntl)  _  ue<n) 

'io^N 


If  the  eigenvalue  X  of  0  having  largest  absolute  value  is  real 
and  positive,  then  e^n+1^  converges  to  the  projection  of  on  the 

eigenspace  of  X.  The  limiting  e,  P9  is  a  solution  of  (6.5a,b). 

a  ( n  ^ 

However,  if  X  is  not  real  and  positive,  then  9'  '  oscillates  and  does 
not  converge.  If  the  smooths  are  self-adjoint  and  non-negative  definite, 
then  SyP  is  the  product  of  two  self-adjoint  non-negative  definite 
matrices,  hence  has  only  real  non-negative  eigenvalues.  We  are  unable  to 
find  conditions  guaranteeing  this  for  more  general  smooths. 

Thus,  in  spite  of  the  fact  that  ACE  has  invariably  converged  (with 
one  exception)  we  cannot  give  a  general  convergence  proof.  We  conjecture 
that  convergence  holds  only  for  '‘most"  data  sets  in  a  sense  made  explicit 
in  the  following  section. 


6.3  Consistency  of  ACE 

For  any  functions  in  H2( Y) ,H2( X^) . H2(Xp),  and 

any  data  set  D€p  define  functions  P.(4>.  jx  .)  by 

J  *  J 

<6J6)  ■  E<¥*i>  W 

Let  ^  in  HgUj)  be  defined  as  the  restriction  of  4k  to  the  set  of 
data  values  (x^ , . . . ,xNj)  minus  its  mean  value  over  the  data  values. 

Assume  that  the  N  data  vectors  (y^.x^)  are  independent  samples 
from  the  distribution  of  (Y,X^,...,X  ). 
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(6.17)  DEFINITION.  Let  be  any  sequence  of  data  smooths.  They 

J  J 

are  mean  square  consistent  if 


for  all  4>q 


E|S(,N)(  l  |x  •)  -  l  p.u.|x  )|* 
j  J  i*i  J  1  J  N 


as  above j  with  the  analogous  definition  for 


S(N) 


The  m.s.  consistency  of  some  smooths  is  discussed  in  the  next  section. 

Whether  or  not  the  algorithm  converges,  a  weak  consistency  result 
can  be  given  under  general  conditions.  Start  with  6Q  E  ^(Y).  On  each 
data  set,  run  the  inner  loop  iteration  m  times,  that  is,  define 

}(ntl)  „  9(n>  -TVn>-{<">) 


Then  set 


fl(n+1)  _  p 
-m  '  Py 


,(n+n  (n+l) 

■m  y  m 


N 


Repeat  the  outer  loop  l  times  getting  the  final  functions  eN(y;m,£), 
<()jN(Xj;m,il) .  Do  the  analogous  thing  in  function  space  starting  with  0q, 
getting  functions  whose  restriction  to  the  data  set  D  are  denoted  by 
0(y;m,£),  <t>,(x.;m,£).  Then 

J  vJ 

(6.18)  THEOREM.  If  the  smooths  S ^ ^  are  m.s.  consistent  and 

y  j 

uniformly  bounded  as  N  —*■<*>,  then 


EH 0^(y »ro> £) _0(y » *m» £) H ^  >-0  ,  Ell4>j|^(Xjim,£)-0j(Xj‘,m,£)ll|^  >-0 

If  0*  is  the  optimal  transformation  P^Qq/IIP^OqII,  =  P^0*,  then  as 
m,  i t— »«>  in  any  way ,  for  the  fresh  start  algorithm 


I9( •  ;m,£)-9*l  —*0  j  * »m,Jt)-<}>^ll  — *-0  . 

J  J 
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PROOF.  First  note  that  for  any  product  of  smooths 

1  £ 

e|$(N)  ..  .3(^)9  _p  ...p  0  |2  q 

tPi1  °0  1 :  OIIN  U 

This  is  illustrated  with  sjN^sjN^e0,  i  /j.  Since  EISjN^6o‘Pj0olN  ~*0, 
then  SjN^9q  =  P j 0q  +4>j  N  where  El<t>j  NU ^  — »-0 .  Therefore 

s(N)/s(N)fl  x  „  s(N)p  +  S(N) 

Si  V  5i  Pj90  +  Si  *j,N 

By  assumption  NIN  <_  MU 4^  N#N,  where  M  does  not  depend  on  N. 

Therefore  By  assumption  EIS1N>f,oVPiPoeolN-°  s° 

that  Els1(N)S<N)0o-PjPj9o|E-*O. 

(6.19)  PROPOSITION.  If  9 ^  is  defined  in  H2(y)  for  all  data  sets  D, 
and  0  G  H2 ( Y )  such  that 

E|eN(y)-e(y)l2  —  0 

eN(y)  ej*l|2_>0 
TbiIn  • 


then 


110 


N  N 


PROOF.  Write  6/1911  =  9/B9!lN  +  6(1/B6J  -1/I0IN).  So  two  parts  are 


needed.  First,  to  show  that 


Second,  that  "TeT“^N ' >0*  For  the  first  Part*  let 

N 


2  1  r,W  ®(yk\2 


S  =  - 

5n  n 


r/_n _ f 

l  ^N1 


N  ,9,N 


)c  =  2(1  - 


(9N’9)N 

,eN,N,e,N 


)  . 


Then  SE  <  4,  so  It  Is  enough  to  show  that  S^-^+0  to  get  ES^  — *0. 


i 


-54- 


Let 


VN  =  N 

=  ieNlJ  +  *0*N  -  2(V0)n 

=  +  2(  ®®®N^0N^N~^0N,0^N^ 

2  2 

Both  terms  are  positive,  and  since  EVjjj  — *-0,  E( II QNtl - S ell N)  — >0, 

E(lelNl6NlN-(eN,0)N)  -*0.  By  the  law  of  large  numbers  e| II ell II ell 2 
2  P 

resulting  in  SN — >-0. 

Now  look  at 


wn  =  1  £ 02(yk)Ci0i^'i[?r]2 


=  I0l?(  1 


N'leC  TeT^ 


=  (1- 


lei 


■n 

N.)2 


Then  EW„  —  0  follows  from  E| lel^-lel 


’N 


•0. 


Using  Proposition  6.19  it  follows  that  Ef0^(y;m,£)-0(y;m,£)Sm 
and  in  consequence,  that  e|<j>j  N(xJ.;m,£)-4>j.(Xj;m,'>'11 
In  function  space,  define 


p(m)  =  g .  f^e 

u  =  pvptm^ 

m  i  a 


Then 


0(*;m,£)  = 


5 

U*0n 

m  u 

lU^0nt 
m  u 


The  last  step  in  the  proof  is  showing  that 

\. 


U"e°  -e*| 


'lU£8j 
VO 


0 
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as  m,  l  go  to  infinity.  Begin  with 

(6.20)  PROPOSITION.  As  m— *•%  Um  — ►  U  in  the  uniform  operator  norm. 

PROOF.  8Um0-U0J  =  #PYTmPx9ll  <  BT,nPx0l .  Nowon  H2(Y),  BTmPxI  —*•().  If 

not,  take  9„,  lei  =  1  such  that  BTmPvemtl  >  6,  all  m.  Let  0  .  -^-*0, 
m  m  a  m  —  m 

then  Pvem-^-*-Pv0,  and 
X  m  X 

■  T01 ' pxem • 1  i  l,TJT,,px(8rn<-e)1  +  ^'Px^ 

<  BPv(0  ,-6)8  +  iT^’pyOB 
—  X  m  x 

By  Proposition  (5.21)  the  right  hand  side  goes  to  zero. 


The  operator  Um  is  not  necessarily  self-adjoint,  but  it  is 
compact.  By  (6.18),  if  0(sp(U) )  is  any  open  set  containing  sp(U), 
then  for  m  sufficiently  large  sp(Um)  C0(sp(U)).  Suppose,  for  sim¬ 
plicity,  that  the  projection  E^  corresponding  to  the  largest  eigenvalue 
X  of  U  is  one-dimensional.  (The  proof  goes  through  if  is  higher¬ 
dimensional  but  is  more  complicated.)  Then  for  any  open  neighborhood  0 


of  A,  and  m  sufficiently  large,  there  is  only  one  eigenvalue  Am  of 
Um  in  0,  Am— i ►A  and  the  projection  P^  of  Um  corresponding  to 

Am  converges  to  P£  in  the  uniform  operator  topology.  Also,  Am  can 

A 

be  taken  as  the  eigenvalue  of  Um  having  largest  absolute  value.  If 
A'  is  the  second  largest  eigenvalue  of  U,  and  A^  the  eigenvalue  of 
U_  having  the  second  highest  absolute  value,  then  (assuming  E, ,  is 

ill  A 

one-dimensional)  X '  — ► A ' . 

m 

Write 


w  =  u  -  p£m) 

m  m  E 


so  again  IW  -Wl  —*0.  Now 
m 
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(6.21) 


umen  =  xmpF^en  +  wien 
m  0  m  E  0  mu 

’  A\60  +  W\ 

A 


For  any  e  >  0  we  will  show  that  there  exists  mg, 

m  1  m0»  £  >  £g 

(6.22)  iWmeoHXm-£:  ’  «w\l|X£< 


Take  r  =  (X+X')/2  and  select  mg  such  that  r  > 
Denote  by  R(x,Wm)  the  resolvent  of  Wm-  Then 


and 


A*R(X,W_)dX 
|Xi=r  m 


B  R( X  »W_) I d | X | 
I X !  =r  m 


where  d|X| is  arc  length  along  |x|  =  r.  On  |X|  : 

lR(X,Wm)ll  is  continuous  and  bounded.  Furthermore 

uniformly.  Letting  M(r)  =  max  I1R(X,W)S,  then 

1 X  |  =r 

awf;8  <  r£H(r)(l+SA) 
m  —  mm 


where  A  6  -*0  as  m->°°.  Certainly 
mm 

IW£I  <  r£M(r)  . 


Fix  6  >  0  such  that  (l+6)r  <  X.  Take  mg  such 
m  >  max(m0,mg),  Xmj>(l+6)r.  Then 

and 

iw£l/x£  <  (^)£M(r)  . 


i g  such  that  for 


max(X\|xJ,  m>m0). 


r,  for  m  >  mg, 
HR(X,Wm)B  — ►  BR(X,W)  II 


that  for 
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Now  choose  a  new  mg  and  Jig  such  that  (5.6)  is  satisfied. 
Using  (5.6) 


UmeO 
ro  u 


PE  60 
m 

m 


=  e, 


m,Jl 


where  e, 


"0  as  m,Jl  — *•«>.  Thus 


PE  60 
m 


V° 


''“mV 


8*’  ■  cm.t  +  IIP.  O "Ip.  O 

fcm  u  u 


and  the  right  side  goes  to  zero  as  m,Jl— 

The  term  weak  consistency  is  used  above  because  we  have  in  mind  a 
desirable  stronger  result.  We  conjecture  that  for  reasonable  smooths,  the 

set  cn  =  ( ( Yl *- 1 ) . ^YN’-n);  al9°rithn>  converges)  satisfies  P(CN)-*1 

and  that  for  0^  the  limit  on  CN  starting  from  a  fixed  6g, 

ECicNieN-e*ip]-o  . 

We  also  conjecture  that  such  a  theorem  will  be  difficult  to  prove. 
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APPENDIX  1 

Most  Reasonable  Sequences  of  Uniformly  Bounded  Smooths  are  M.S.  Consistent 

If  the  window  size  goes  to  zero  at  the  right  rate  as  N  —*■<»,  then 
most  "reasonable"  smooths  which  utilize  local  smoothing  are  m.s.  consis¬ 
tent.  There  is  a  substantial  literature  on  consistency,  usually  in 
higher  dimensional  spaces.  Stone's  pioneering  paper  [1977]  established 
consistency  for  k-nearest  neighbor  smoothing.  Devroye  and  Wagner  [1980] 
and  independently  Spiegelman  and  Sacks  [1980]  gave  weak  conditions  for 
consistency  of  kernel  smooths.  See  Stone  [1977]  and  Devroye  [1981]  for  a 
review  of  the  literature. 

The  coirmon  definition  of  consistency  is:  given  a  set  of  N-l  inde¬ 
pendent  copies  (X,,Yj),...,(Xn_j,Yn_j),  of  (X,Y)  drawn  from  the  same 
bivariate  distribution,  and  d>  e  L2( Y) ,  call  S^  L^-consi stent  if 
E[S^(X)-E(<t>(Y)  |X)]2  — *  0.  To  see  that  our  definition  is  equivalent,  put 
down  the  fixed  point  (x,y)  and  then  the  other  N-l  random  points 
(x1,y1),...,(xN1,yN_1).  Now  compute  E[S^(x)-E(Y|x)]2  =  gN(x) .  Our 
definition  of  m.s.  consistency  is  then 

or  EgN(X)— *-0. 

Uniform  boundedness  is  a  critical  condition  for  consistency  proofs. 

A  key  element  in  Stone's  proof  Is  {put  in  different  form) 

(A.l  )  PROPOSITION.  Take  data  sets  drawn  from  a  bivariate  distribution 

/  |U\ 

(X,Y),  S'  a  uniformly  bounded  sequence  of  smooths  on  X,  P^  the 
conditional  expectation  operator.  If,  for  a  set  of  functions  {<f>} 

dense  in  HgfY), 


J 
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E|S(N)(<Hx)-Px(<f>|x)|*  —  0  , 

(N) 

then  the  S  as  m.s.  consistent.  If ,  for  a  set  of  functions  {h} 
dense  in  H2(X), 

(A. 2  )  EflS(N)(hix)-h(x)|^  -v  0 

then  (A. 2  )  holds  for  all  h  €  H2(X). 

The  proof  is  simple  and  is  omitted. 

(N) 

Assume  S'  '  is  linear.  Then 

(A. 3  )  EBS(N)4>-Pxd>H^  £  2EIIS(N)(<j,-Px4,)|^  +  2EB S ( N) Px4>-Pxd>B ^  . 

If  it  can  be  shown  that  EBS^N^h-htt^  — ^0  for  all  continuous  h  e  h2(X) 
vanishing  off  at  finite  intervals,  and  if  the  first  term  on  the  right 
in  (A. 3  )  goes  to  zero  for  all  <j>  such  that  <  °°,  then  (A.l  ) 

implies  that  S'  is  m.s.  consistent.  This  strategy  works  for  a  wide 
variety  of  smooths. 

To  illustrate,  because  Stone's  results  [1977]  do  not  seem  immediately 
applicable  to  bivariate  regression  smooths,  m.s.  consistency  is  proven 
for  a  modified  regression  smooth  similar  to  supersmoother.  For  x  any 
point,  let  J(x)  be  the  indices  of  the  M  points  in  { x^}  directly 
above  x  plus  the  M  below.  If  there  are  only  M1  <  M  above  (below) 
then  include  the  M+(M-M')  directly  below  (above).  For  a  regression 
smooth 

,  .  -  rvU.x) 

(  A. 4  )  S(<t>| x)  =  $x  +  - — (*-*x) 


where  *  xx  are  the  averages  of  <t>(yk),  over  the  indices  in  J(x), 
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o 

r  ($,x),  ox  the  covariance  between  ^(y^),  and  the  variance  of  xk 
over  the  indices  in  J(x). 

Write  the  second  term  in  (A. 3  )  as 


rx(<j>,x)  (x-xx) 


If  there  are  M  points  above  and  below  in  J(x),  it  is  not  hard  to  show 
that 


This  is  not  true  near  endpoints  where  (x-xx)/°x  can  become  arbitrarily 
large  as  M  gets  large.  This  endpoint  behavior  keeps  regression  from 
being  uniformly  bounded.  To  remedy  this,  define  a  function 


f  x  .  M  <  1 

[  sgn(s)  ,  jx)  >  1  , 


and  define  the  modified  regression  smooth  by 

r„(<M)  x-x 

(A. 5  )  S(4> | x)  =  <t>x  +  "g—  [-^3 1 

This  modified  smooth  is  bounded  by  2. 

(  A. 6  )  THEOREM.  If,  as  N —*•<»,  M  — *■<», M  log  N/N— >0  and  P(dx)  has  no 
atoms,  then  the  modified  regression  smooths  are  m.s.  consistent. 

PROOF.  Assume  <  «  and  use  the  inequality  (A. 3  )  with  g( x)  - 

PxU|x).  Then 

s(<i>-g|x)  =  ^  Ejei(x)^V"9*xJ^{1  +  V  ' 
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2 

The  conditional  expectation  of  [SU-gjx)]  given  {x^i  is 

* 


1 

17 


Thus,  the  first  term  in  (A. 3  )  is  asymptotically  zero.  Now  look  at 
S ( h ) x)  -h(x),  h  e  ^(X),  h  continuous  and  zero  outside  a  finite 
interval ; 


S(h|x)  -h(x)  =  gjf  Ij&J(x)(Mxj)-Mx))(l  +  X  l  a 


Then,  for  H(6)  =  max{ jh(x' )-h(x") j ;  |x'-x"l  <<$}> 

[S(h|x)-h(x)]2  <  [j6J(x)(h(xj)-h(x))2]*2 

<  2H(  max  |x.-x|)  . 
jej(x)  0 

2 

Then  to  get  E[S(h|x)-h(x)]  ~*0,  it  is  enough  to  show  that  A 
An  =  max{|Xj-x|;  Xj€J(x)}  converges  in  probability  to  zero.  Take  x 
to  be  a  point  such  that  P[(x,x+e)]  >  0,  P[(x-e,x)]  >  0  for  all  e  >  0 
The  set  S  of  all  such  points  has  P(S)  =  1.  Then 


{&N  >e}  C  {at  most  2M-1  of  {x^}  in  (x-e,x)} 

u  {at  most  2M-1  of  {x^}  in  (x,x+e)}  . 

Using  the  Binomial  distribution  gives  the  bound 

P(An>£)  <  2NM{(1  -P[(x,x+c)])N'M  +  (l  -P[(x-e,x)])N‘M> 

Holding  e  fixed  with  N— *•«  and  M  =  o(N/log  N)  results  in 
P(A^>e)-*0,  proving  the  theorem. 


-62- 


References 

Anscombe,  F.J.  and  Tukey,  J.W.  (1963).  The  examination  and  analysis  of 
residuals.  Techmmetrics  5,  141-160. 

Belsey,  D.A.,  Kuh,  E.  and  Welsch,  R.E.  (1980).  Regression  Diagnostics, 
John  Wiley  and  Sons. 

Box,  G.E.P.  and  Tidwell,  P.W.  (1962).  Transformations  of  the  independent 
variables.  Technometrias  4,  531-550. 

Box,  G.E.P.  and  Cox,  D.R.  (1964).  An  analysis  of  transformations.  J.R. 
Statist.  Soc.  B  26,  211-252. 

Box,  G.E.P.  and  Hill,  W.J.  (1974).  Correcting  inhomogeneity  of  variance 
with  power  transformation  weighting.  Technometrics  16,  385-389. 

Cleveland,  W.S.  (1979).  Robust  locally  weighted  regression  and  smoothing 
scatterplots.  J.  Amer.  Statist.  Assoc.  74,  828-836. 

Craven,  P.  and  Wahba,  G.  (1979).  Smoothing  noisy  data  with  spline 

functions.  Estimating  the  correct  degree  of  smoothing  by  the  method 
of  generalized  cross-validation.  Numerische  Mathematik  31,  317-403. 

deBoor,  C.  (1978).  A  Practical  Guide  to  Splines,  Springer-Verlag. 

Devroye,  L.  (1981).  On  the  almost  everywhere  convergence  of  nonparametric 
regression  function  estimates.  Ann.  Statist.  9,  1310-1319. 

Devroye,  L.  and  Wagner,  T.J.  (1980).  Distribution-free  consistency 
results  in  nonparametric  discrimination  and  regression  function 
estimation.  Ann.  Statist.  8,  231-239. 

Draper,  N.R.  and  Cox,  D.R.  (1969).  On  distributions  and  their  transfor¬ 
mations  to  normality.  J.R.  Statist.  Soc.  B  31,  472-476. 

Efron,  B.  (1979).  Bootstrap  methods:  another  look  at  the  jackknife. 

Ann.  Statist.  7,  1-26. 


-63- 


Fraser,  D.A.S.  (1967).  Data  transformations  and  the  linear  model.  Ann. 
Math.  Statist.  38,  1456-1465. 

Friedman,  J.H.  and  Stuetzle,  W.  (1982).  Smoothing  of  scatterplots.  Dept. 

of  Statistics,  Stanford  University,  Tech.  Report  0RI0N006. 

Gasser,  T.  and  Rosenblatt,  M.  (eds.)  (1979).  Smoothing  Techniques  for 
Curve  Estimation ,  in  Lecture  Notes  in  Mathematics  757,  New  York: 
Springer-Verlag. 

Gebelein,  H.  (1941).  Das  statitistiche  problem  der  korrelation  als 
variations  und  eigenwert  problem  und  sein  Zusammenhang  mit  der 
Ausgl eichung-srechnung .  Z.  Angew.  Math.  Mech.  21,  364-379. 

Harrison,  D.  and  Rubinfeld,  D.L.  (1978).  Hedonic  housing  prices  and  the 
demand  for  clean  air.  J.  Environ.  Econ.  Mngmnt  5,  81-102. 

Kendall,  M.A.  and  Stuart,  A.  (1967).  The  Advanced  Theory  of  Statistics , 
Volume  2,  Hafner. 

Kruskal,  J.B.  (1964).  Nonmetric  multidimensional  scaling:  a  numerical 
method.  Psychometrika  29,  115-129. 

Kruskal,  J.B.  (1965).  Analysis  of  factorial  experiments  by  estimating 
monotone  transformations  of  the  data.  J.B.  Statist.  Soc.  B  27, 
251-263. 

Linsey,  J.K.  (1972).  Fitting  response  surfaces  with  power  transformations. 
J.B.  Statist.  Soc.  C  21,  234-237. 

Linsey,  J.K.  (1974).  Construction  and  comparison  of  statistical  models. 
J.B.  Statist.  Soc.  B  36,  418-425. 

Mosteller,  F.  and  Tukey,  J.W.  (1977).  Data  Analysis  and  Begression, 
Addlson-Wesley. 

Renyi,  A.  (1959).  On  measures  of  dependence.  Acta.  Math.  Acad.  Sci. 
Hungar.  10,  441-451. 


-64- 


Sarmanov,  O.V.  (1958a).  The  maximal  correlation  coefficient  (symmetric 
case).  Dokl.  Acad .  Nauk.  SSSR  120,  715-718. 

Sarmanov,  O.V.  (1958b).  The  maximal  correlation  coefficient  (nonsymmetric 
case).  Dokl.  Acad.  Nauk.  SSSR  121,  52-55. 

Spiegelman,  C.  and  Sacks,  J.  (1980).  Consistent  window  estimation  in 
nonparametric  regression.  Arm.  Statist.  8,  240-246. 

Stone,  C.J.  (1977).  Consistent  nonparametric  regression.  Arm.  Statist.  7, 
139-149. 

Tukey,  J.W.  (1982).  The  use  of  smelting  in  guiding  re-expression,  in 
Modem  Data  Analysis ,  Laurner  and  Siegel  (eds.),  Academic  Press. 

Wood,  J.T.  (1974).  An  extension  of  the  analysis  of  transformations  of 
Box  and  Cox.  Appl.  Statist.  ( J.R .  Statist.  Soa.  C )  23. 


-65- 


Appendix  2 

FORTRAN  Implementation  of  ACE  Algorithm 

This  section  presents  a  listing  of  a  FORTRAN  program  implementing  the 
ACE  algorithm.  It  consists  of  six  subroutines.  The  first  two  are  called 
by  the  user  to  perform  functions  (ACE .ACEMOD) ,  the  next  is  a  BLOCK  DATA 
subprogram  in  which  the  values  of  several  internal  parameters  are  initialized, 
and  the  last  three  provide  utilities  used  by  the  first  two  subroutines.  The 
user  interface  is  through  FORTRAN  subroutine  calls.  The  data  and  various 
input  parameters  ar-  supplied  as  arguments  in  the  calling  sequence.  The 
optimal  transformation  estimates  and  other  output  quantities,  as  well  as 
required  scratch  storage  are  also  passed  as  subroutine  arguments.  In  order 
to  employ  these  routines  it  is  necessary  to  write  a  main  or  calling  program 
that  reads  the  input  data  into  appropriate  arrays,  to  declare  additional 
arrays  for  output  and  scratch  storage,  and  then  to  execute  a  call  to  the 
appropriate  subroutine  (ACE  or  ACEMOD). 

SUBROUTINE  ACE  computes  the  optimal  transformation  estimates  using  the 
double  loop  restart  version  of  the  ACE  algorithm  (see  Sections  2  and  5.3). 
These  estimates  are  stored  as  a  set  of  transformed  values,  one  value  for 
each  observation,  for  the  response  and  each  predictor  variable.  Upon  return 
from  SUBROUTINE  ACE  these  values  are  stored  in  the  arrays  specified  in  the 
subroutine  call.  This  information  can  then  be  passed  to  appropriate  graphics 
routines  for  display,  or  to  function  approximation  routines  for  summarization. 
SUBROUTINE  ACEMOD  can  (optionally)  be  called  to  estimate  new  response  values 
given  a  set  of  predictor  values  and  the  optimal  transformation  estimates  from 
SUBROUTINE  ACE. 


Appendix  2  -  Page  2 


-66- 


As  written,  this  program  is  intended  to  be  used  in  conjunction  with  a 
particular  smoothing  subroutine  ("super-smoother")  which  is  listed  in 
Friedman  and  Stuetzle  (1982).  Our  experience  so  far  with  the  ACE  procedure 
has  been  gained  in  this  context.  It  is  possible  to  employ  other  smoothing 
routines  by  properly  interfacing  them  with  the  ACE  code.  ACE  calls  for  the 
smooth  by  the  FORTRAN  statement 

CALL  SUPSMU  (N,X,Y,U,L, ALPHA, RESPAN, IBIN,SMO,SCR) 
where  the  parameters  have  the  following  meaning: 

N:  number  of  observations  (X,Y)  pairs. 

X(1...N):  ordered  abscissa  values. 

Y(1...N):  corresponding  ordinate  values. 

W(1...N):  weight  for  each  observation. 

L:  abscissa  variable  flag  -  L  =  1  ordered  variable. 

L  =  2  periodic  (circular)  variable. 

ALPHA, RESPAN, IB IN :  miscellaneous  parameters  (can  be  set  through 

COMMON/PARMS/) . 

$M0(1...N):  output  smoothed  ordinate  values. 

SCR(1. . .N,l. . .3) :  scratch  array. 
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SUBROUTINE  ACE  ( P, N, X, Y, W, L, DELRSQ, TX, TY, RSQ, IERR.M, Z) 


ESTIMATE  OPTIMAL  TRANSFORMATIONS  FOR  MULTIPLE  REGRESSION  AND 
CORRELATION  BY  ALTERNATING  CONDITIONAL  EXPECTATION  ESTIMATES. 

(7/17/82) 

( BREIMAN  AND  FRIEDMAN,  1982) 

CODED  BY:  J.  H.  FRIEDMAN 

DEPARTMENT  OF  STATISTICS  AND 
LINEAR  ACCELERATOR  CENTER 
STANFORD  UNIVERSITY 
STANFORD,  CA .  94305 

INPUT: 

N  :  NUMBER  OF  OBSERVATIONS. 

P  :  NUMBER  OF  PREDICTOR  VARIABLES  FOR  EACH  OBSERVATION. 

X ( P , N )  :  PREDICTOR  DATA  MATRIX. 

Y  ( N )  :  RESPONSE  VALUE  FOR  EACH  OBSERVATION. 

W(N)  :  WEIGHT  FOR  EACH  OBSERVATION. 

L(P+1 )  :  FLAG  FOR  EACH  VARIABLE. 

L ( 1 )  THROUGH  L(P)  :  PREDICTOR  VARIABLES. 

L ( P+1 )  :  RESPONSE  VARIABLE. 

L ( I ) =0  =>  ITH  VARIABLE  NOT  TO  BE  USED. 

L( I )=1  =  >  ITH  VARIABLE  ASSUMES  ORDERED  VALUES. 

L(  I )  =  2  =  >  ITH  VARIABLE  ASSUMES  CIRCULAR  (PERIODIC)  VALUES. 
L ( I ) =3  =  >  ITH  VARIABLE  TRANSFORMATION  IS  TO  BE  MONOTONE. 
L(I)-4  =>  ITH  VARIABLE  TRANSFORMATION  IS  TO  BE  LINEAR. 

L( I ) =5  =>  ITH  VARIABLE  ASSUMES  CATEGORICAL  VALUES. 

DELRSQ  :  TERMINATION  THRESHOLD.  ITERATION  STOPS  WHEN 
RSQ  CHANGES  LESS  THAN  DELRSQ  IN  NTERM 

CONSECUTIVE  ITERATIONS  (SEE  BELOW  -  DEFAULT,  NTERM=3 ) . 
OUTPUT: 

TX (N , P )  :  PREDICTOR  TRANSFORMATIONS . 

TX  ( J,  I )  =  TRANSFORMED  VALUE  OF  ITH  PREDICTOR  FOR  JTH  OBS  . 
TY(N)  =  RESPONSE  TRANSFORMATION. 

TY  ( J  )  =  TRANSFORMED  RESPONSE  VALUE  FOR  JTH  OBSERV  i'-’ION  . 

RSQ  =  FRACTION  OF  VARIANCE ( TY<Y> ) 

P 

EXPLAINED  BY  SUM  TX(I)<X(I)>  . 

1  =  1 

IEP.R  :  ERROR  FLAG. 

I ERR  =  0  :  NO  ERRORS  DETECTED. 

IERR  >  0  :  ERROR  DETECTED  -  SEE  FORMAT  STATEMENTS  BELOW. 
SCRATCH: 


M( N, P+1 ) ,  Z ( N, 7 )  :  INTERNAL  WORKING  STORAGE. 

(Z(J,1),  J=1,N)  CONTAIN  (TRANSFORMED)  RESIDUALS  AS  OUTPUT. 
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C  NOTE:  THIS  ROUTINE  USES  AS  A  PRIMITIVE  THE  'SUPER  SMOOTHER' 
C  SUPSMU  (SEE  -  FRIEDMAN  AND  STUETZLE  (1982).  SMOOTHING  OF 
C  SCATTERPLOTS .  STANFORD  UNIVERSITY  STATISTICS  DEPARTMENT 

C  REPORT  ORION006. ) 

C 

C - 

INTEGER  P , PP 1,M(N,1),L(1) 

REAL  Y(N),X(P,N),W(N),TY(N),TX(N,P),Z(N,7) ,CT(10) 

COMMON  /PARMS/  ITAPE,MAXIT, NTERM, ALPHA, RESPAN, IBIN 

DOUBLE  PRECISION  SM,SV,SW 

IERR— 0 

PP1=P+1 

SM=0 . 0 

SV=SM 

SW=SV 

IF  (L(PP1) .GT.O)  GO  TO  10 
IERR=4 

IF  (ITAPE.GT.O)  WRITE  (ITAPE.360)  PP1 
RETURN 
10  NP=0 

DO  20  1=1, P 
IF  (L(I).GT.O)  NP=NP+1 
20  CONTINUE 

IF  (NP.GT.O)  GO  TO  30 
IERR=5 

IF  (ITAPE.GT.O)  WRITE  (ITAPE,370)  P 
RETURN 

30  DO  40  J=1,N 

SM=SM+W(J)*Y(J) 

SV=SV+W( J)*Y(J)**2 
SW=SW+W(J) 

M(  J, PPI )=J 
Z  ( J ,  2 ) =Y ( J ) 

40  CONTINUE 

IF  (SW.GT.0.0)  GO  TO  50 
IERR=1 

IF  (ITAPE.GT.O)  WRITE  (ITAPE,330) 

RETURN 

50  SM=SM/SW 

SV=SV/SW-SM**2 

IF  (SV.LE.O.O)  GO  TO  60 

SV=1.0/DSQRT(SV) 

GO  TO  70 
60  IERR=2 

IF  (ITAPE.GT.O)  WRITE  (ITAPE.340) 

RETURN 

70  DO  80  J=1,N 

Z(J,1)=(Y(J)-SM)*SV 
80  CONTINUE 

CALL  SORT  (Z(1,2),M(1,PP1),1,N) 

DO  100  1=1, P 

IF  (L(I).LE.O)  GO  TO  100 
DO  90  J=1,N 
TX( J, I )=0. 0 
M(J, I )»J 
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Z ( J, 2  ) =X ( I ,  J  ) 

90  CONTINUE 

CALL  SORT  (Z(1,2),M(1,I),1,N) 

100  CONTINUE 
RSQ=0 . 0 
ITER=0 

NTERM=MINO (NTERM, 10) 

NT=0 

DO  110  1=1, NTERM 
CT ( I )=100 . 0 
110  CONTINUE 
120  ITER=ITER+1 
DO  130  J=1 , N 
TY ( J )=Z ( J, 1 ) 

130  CONTINUE 
NIT=0 

140  RSQI=RSQ 
NIT=NIT+1 
DO  200  1=1, P 

IF  (L(I) .LE.O)  GO  TO  200 
DO  150  J=1,N 
K=M( J, I) 

Z ( J , 1 ) =TY ( K ) +TX ( K , I ) 

Z ( J , 2 ) =X ( I , K ) 

Z ( J , 7 ) =W ( K ) 

150  CONTINUE 

CALL  SMOTHR  (L(I),N,Z(1,2),Z,Z{1,7),Z(1,6),Z(1,3)) 

SM=0 -  0 

DO  160  J=1 , N 
SM=SM+Z (J,7)*Z(J,6) 

160  CONTINUE 
SM=SM/SW 
DO  170  J=1 , N 
Z(J,6)=Z(J,6)-SM 
170  CONTINUE 
SV=0.0 

DO  180  J=1,N 

SV=SV+Z (J, 7 ) * ( Z ( J , 1 )-Z(J, 6) )**2 
180  CONTINUE 

SV=1 . o-sv/sw 

IF  (SV.LE.RSQ)  GO  TO  200 

RSQ=SV 

DO  190  J=1 , N 

K=M ( J, I ) 

TX(K, I )=Z ( J, 6  ) 

TY ( K ) =Z ( J , 1 ) -Z ( J , 6 ) 

190  CONTINUE 
200  CONTINUE 

IF  ( (NP . NE . 1 ) . AND . ( ( RSQ-RSQI . GT . DELRSQ) • AND. (NIT. LT . MAXIT) ) )  GOTO 
1  140 

IF  (RSQ.NE.O.O.OR.ITER.NE.l)  GO  TO  230 
DO  220  1=1, P 

IF  (L(I) .LE.O)  GO  TO  220 
DO  210  J=1 , N 
TX( J , I )=X ( I , J ) 
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210  CONTINUE 
220  CONTINUE 
230  DO  250  J=1,N 
K=M( J, PP1 ) 

Z  ( J ,  2  ) =Y { K ) 

Z(J,7)=W(K) 

Z(J,1)=0.0 
DO  240  1=1, P 

IF  (L(I).GT.O)  Z(J, 1)=Z(J, 1) +TX ( K , I ) 

240  CONTINUE 
250  CONTINUE 

CALL  SMOTHR  (L( PP1 ) .  N,  Z ( 1 , 2 ) ,  Z,  Z ( 1 , 7  )  ,  Z ( 1 , 6 ) ,  Z ( 1 , 3 ) ) 

SM=0 . 0 

SV=SM 

DO  260  J=1 , N 
K=M( J, PP1 ) 

SM=SM+W(K)*Z( J, 6) 

SV=SV+W(K)*Z(J,6)**2 
Z(K, 2)=Z( J , 1 ) 

260  CONTINUE 
SM=SM/SW 
SV=SV/SW-SM**2 
IF  (SV.LE.0.0)  GO  TO  270 
SV=1.0/DSQRT(SV) 

GO  TO  280 
270  IERR=3 

IF  (ITAPE.GT.O)  WRITE  (ITAPE,350) 

RETURN 

280  DO  290  J=1,N 
K=M( J, PP1 ) 

TY(K)=(Z(J,6)~SM)*SV 
290  CONTINUE 
SV=0 . 0 

DO  300  J=1.N 
Z(J, 1)=TY(J)-Z(J, 2) 

SV=SV+W(J)*Z(J,1)**2 
300  CONTINUE 

RSQ=1.0-SV/SW 

IF  (ITAPE.GT.O)  WRITE  (ITAPE.320)  ITER, RSU 

NT=MOD (NT, NTERM) +1 

CT(NT)=RSQ 

CMN=100.0 

CMX=-100 . 0 

DO  310  1=1, NTERM 

CMN=AMIN1 (CMN , CT ( I ) ) 

CMX=AMAX1 ( CMX, CT ( I ) ) 

310  CONTINUE 

IF  ( (CMX-CMN.GT.DELRSQ) .AND. (ITER. LT.MAXIT) )  GO  TO  120 


320 

RETURN 
FORMAT ( 

11H 

ITERATION  12, 

23H  R**2  = 

1  -  E**2  =G 1 2 . 4 ) 

330 

FORMAT ( 

41H 

IERR=1: 

SUM  OF 

WEIGHTS  (W)  NOT 

POSITIVE. ) 

340 

FORMAT ( 

29H 

IERR=2: 

Y  HAS 

ZERO  VARIANCE.) 

350 

FORMAT  ( 

30H 

IERR=3: 

TY  HAS 

ZERO  VARIANCE.) 

360 

FORMAT ( 

11H 

IERR=4: 

L{  12, 

19H )  MUST  BE  POSITIVE.) 

370 

FORMAT ( 

29H 

IERR=5: 

AT  LEAST  ONE  L(1)-L(I2, 

19H )  MUST  BE  POSIT 
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1 IVE • ) 

END 

SUBROUTINE  ACEMOD  (V, P, N, X, Y, L, TX, TY, M, YHAT, IERR) 

C - 

C 

C  COMPUTES  RESPONSE  ESTIMATES  FROM  THE  MODEL 
C  -IP 

C  YHAT  =  TY  (  SUM  TX(I)<V(I)>  ) 

C  1  =  1 

C  USING  THE  TRANSFORMATIONS  CONSTRUCTED  BY  SUBROUTINE  ACE. 

C 

C  INPUT: 

C 

C  V(P)  :  VECTOR  OF  PREDICTOR  VALUES. 

C  P,  N ,  X,  Y ,  L  :  SAME  INPUT  AS  FOR  SUBROUTINE  ACE. 

C  TX, TY, M  :  OUTPUT  FROM  SUBROUTINE  ACE. 

C 

C  OUTPUT: 

C 

C  YHAT  :  ESTIMATED  RESPONSE  VALUE  FOR  V. 

C  IERR  :  ERROR  FLAG. 

C  IERR=0:  NO  ERROR  DETECTED. 

C  IERR=I:  ERROR  DETECTED  -  SEE  FORMAT  STATEMENT  BELOW. 

C 

C  NOTE:  THIS  ROUTINE  REQUIRES  THAT  THE  RESPONSE  TRANSFORMATION  TY  IS  A 
C  STRICTLY  (INCREASING  OR  DECREASING)  MONOTONE  FUNCTION  OF  Y,  THAT 
C  IS  L ( P+1 )  =  3  OR  4  IN  THE  CALL  TO  SUBROUTINE  ACE. 

C 

C - 

INTEGER  P, PP1,M(N, 1) , L(1 ) , LOW, HIGH, PLACE 
REAL  V(P) ,X(P,N) ,Y(N) , TY ( N ) , TX ( N , P ) 

COMMON  /PARMS/  ITAPE , MAXIT , NTERM, ALPHA, RESPAN , IBIN 

PP1=P+1 

IERR=0 

IF  (L(PP1) .EQ.3. OR. L( PP1 ) . EQ. 4 )  GO  TO  10 
IERR=1 

IF  (ITAPE. GT.O)  WRITE  (ITAPE, 140)  PP1 
RETURN 
10  YH=0 . 0 

DO  80  1=1, P 

IF  ( L ( I ) . LE . 0 )  GO  TO  80 
VI=V(I) 

IF  (VI .GT.X(I,M(1, I) ) )  GO  TO  20 

PLACE=1 

GO  TO  70 

20  IF  (VI.LT.X(I,M(N, I) ) )  GO  TO  30 
PLACE=N 
GO  TO  70 
30  LOW=0 

HIGH=N+1 

40  IF  ( LOW+1 . GE . HIGH )  GO  TO  60 
PLACE* ( LOW+HIGH ) / 2 
XT=X( I , M( PLACE, I ) ) 

IF  (VI.EQ.XT)  GO  TO  70 
IF  (VI.GE.XT)  GO  TO  50 


nnnnnnnnnonnnfin 
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HIGH=PLACE 
GO  TO  40 
50  LOW= PLACE 

GO  TO  40 
60  JL=M(LOW, I) 

JH=*M(  HIGH, I ) 

YH=YH+TX( JL, I )+ (TX( JH, I ) -TX( JL, I ) ) * ( VI-X ( I , JL) ) / (X ( I , JH ) -X ( I , JL) ) 
GO  TO  80 

70  YH=YH+TX(M{ PLACE, I) , I) 

80  CONTINUE 

IF  ( YH . GT . TY ( M( 1 , PP1 ) ) )  GO  TO  90 
YHAT=Y ( M { 1 , PP1 ) ) 

RETURN 

90  IF  ( YH . LT . TY (M(N , PP1 ) ) )  GO  TO  100 
YHAT=Y(M(N,  PPl ) ) 

RETURN 
100  LOW=0 

HIGH=N+1 

XT=TY(M(N,PP1) )-TY(M(l, PPl) ) 

XT=XT/ABS(XT) 

110  IF  ( LOW+1 . GE . HIGH )  GO  TO  130 
PLACE= ( LOW+HIGH ) / 2 

IF  ( XT*YH . GE . XT*TY(M( PLACE, PPl ) ) )  GO  TO  120 
HIGH=PLACE 
GO  TO  110 
120  LOW=PLACE 
GO  TO  110 
130  JL=M(L0W,PP1) 

JH=M( HIGH, PPl ) 

YHAT=Y ( JL) +( Y( JH)- Y( JL) }*(YH-TY( JL) ) / (TY( JH ) -TY( JL) ) 

RETURN 

140  FORMAT (  11H  IERR=1 :  L(I2,  55H)  MUST  EQUAL  3  OR  4  -  MONOTONE 

1 RESPONSE  TRANSFORMATION.) 

END 

BLOCK  DATA 

COMMON  /PARMS/  ITAPE, MAXIT, NTERM, ALPHA, RESPAN, IBIN 


THESE  PROCEDURE  PARAMETERS  CAN  BE  CHANGED  IN  THE  CALLING  ROUTINE 
BY  DEFINING  THE  ABOVE  LABELED  COMMON  AND  RESETTING  THE  VALUES  WITH 
EXECUTABLE  STATEMENTS. 

ITAPE  :  FORTRAN  FILE  NUMBER  FOR  PRINTER  OUTPUT. 

( ITAPE. LE.O  =>  NO  PRINTER  OUTPUT.) 

MAXIT  :  MAXIMUM  NUMBER  OF  ITERATIONS. 

NTERM  :  NUMBER  OF  CONSECUTIVE  ITERATIONS  FOR  WHICH 

RSQ  MUST  CHANGE  LESS  THAN  DELCOR  FOR  CONVERGENCE. 

ALPHA,  RESPAN,  IBIN  :  SUPER  SMOOTHER  PARAMETERS. 

(SEE  -  FRIEDMAN  AND  STUETZLE,  REFERENCE  ABOVE.) 


DATA  ITAPE, MAXIT, NTERM, ALPHA, RESPAN, IBIN  /6 , 20 , 3 , 0 . 1 , 0 . 2 5 , 1 / 
END 

SUBROUTINE  SMOTHR  (L, N, X, Y, W, SMO, SCR) 

REAL  X(N) ,Y(N) ,W(N) , SMO(N) , SCR(N, 3 ) 
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common  /PARMS/  ITAPE,MAXIT,NTERM, ALPHA, RESPAN, IBIN 
DOUBLE  PRECISION  SM,SW,A,B,D 
IF  (L. LT. 5 )  GO  TO  50 
J«1 

10  J  0s*  J 

SM*W{ J )*Y(J) 

SW*W(J) 

IF  (J.GE.N)  GO  TO  30 
20  IF  (X(J  +  1) .GT.X(J)  )  GO  TO  30 
J=J  +  1 

SM=SM+W(J)*Y(J) 

SW=SW+W( J ) 

IF  (JT.LT.N)  GO  TO  20 
30  SM— SM/SW 

DO  40  I=J0,J 
SMO(I )=SM 
40  CONTINUE 
J=J  +  1 

IF  (J.LE.N)  GO  TO  10 
GO  TO  240 

50  IF  (L.NE.4)  GO  TO  80 
SM=0  -  0 
SW=SM 
B=SW 
D=B 

DO  60  J=1,N 

SM=SM+W(J)*X(J)*Y(J) 

SW=SW+W( J)*X(J)**2 

B=B+W ( J ) *X ( J ) 

D=D+W( J) 

60  CONTINUE 

A=SM/(SW-(B**2)/D) 

b=b/d 

DO  70  J  =  1 ,  N 
SMO(J)=A* (X(J)-B) 

70  CONTINUE 
GO  TO  240 

80  CALL  SUPSMU  ( N, X, Y, Wf L, ALPHA, RESPAN, IBIN, SMO, SCR ) 
IF  (L.NE.3)  GO  TO  240 
DO  90  J=1,N 
SCR( J, 1 )=SMO(J) 

SCR (N-J+l ,2) =SCR ( J , 1 ) 

90  CONTINUE 

CALL  MONTNE  (SCR.N) 

CALL  MONTNE  (SCR(1,2)#N) 

SM=0 . 0 
SW=SM 

DO  100  J=1,N 

SM=SM+(SMO(J)-SCR(J, 1) )**2 
SW=SW+ ( SMO ( J ) -SCR(N-J+1 «  2 ) )**2 
100  CONTINUE 

IF  (SM.GE.SW)  GO  TO  120 
DO  110  J»1,N 
SMO(J)=SCR(J, 1) 

110  CONTINUE 
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GO  TO  140 
DO  130  J»1,N 
SMO(J)=SCR{N— J+1,2) 

CONTINUE 
J  =  1 
J  0  *0 

IF  (J.GE.N)  GO  TO  170 

IF  (SMO(j+l) .NE.SMO(J) )  GO  TO  170 

J=J+1 

IF  (J.LT.F)  GO  TO  160 
IF  (J.LE.JO)  GO  TO  190 
A=0 . 0 

IF  (J0.GT.1)  A=0. 5* (SMO( JO)-SMO(JO-1 ) ) 
B=0 . 0 

IF  (J.LT.N)  B=0. 5*(SMO(J+l )-SMO(J ) ) 
D=(A+B)/(J-J0) 

IF  (A.EQ.O.O.OR.B.EQ.O.O)  D=2.0*D 

IF  (A.EQ.0.0)  A=B 

DO  180  I=J0,J 

SMO ( I ) =SMO ( I ) -A+D* { I- JO ) 

CONTINUE 

J=J+1 

IF  (J.LE.N)  GO  TO  150 

J=1 

J0=J 

SM=>SMO(J) 

IF  (J.GE.N)  GO  TO  220 

IF  (X(J+1) .GT.X(J) )  GO  TO  220 

J=J+1 

SM=SM+SMO(J) 

IF  (J.LT.N)  GO  TO  210 
sm=sm/ ( J-JO+1 ) 

DO  230  I=J0,J 
SMO ( I ) =SM 
CONTINUE 
J=J+1 

IF  (J.LE.N)  GO  TO  200 

RETURN 

END 

SUBROUTINE  MONTNE  (X,N) 

REAL  X ( N ) 

INTEGER  BB, EB, BR, ER, BL, EL 

BB«0 

EB=BB 

IF  (EB.GE.N)  GO  TO  110 

BB«EB+1 

EB«BB 

IF  (EB.GE.N)  GO  TO  30 
IF  (X(BB) .NE.X(EB+1) )  GO  TO  30 
EB»EB+1 
GO  TO  20 

IF  (EB.GE.N)  GO  TO  70 

IF  (X ( EB ) . LE . X ( EB+1 ) )  GO  TO  70 

BR-EB+1 

ER-BR 
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40  IP  (ER.GE.N)  GO  TO  50 

IF  (X(ER+1 ) .NE.X(BR) )  GO  TO  50 

ER=ER+1 

GO  TO  40 

50  PMNa (X ( BB ) * ( EB-BB+1 ) +X ( BR) * ( ER-BR+1 ) )/(ER-BB+l) 
EB=ER 

DO  60  I=BB , EB 
X(I )=PMN 
60  CONTINUE 
70  IF  (BB.LE.l)  GO  TO  10 

IF  (X(BB-l) .LE.X(BB) )  GO  TO  10 

BL=BB-1 

EL=BL 

80  IF  (BL.LE.l)  GO  TO  90 

IF  (X(BL-l) .NE.X(EL) )  GO  TO  90 

BL=BL-1 

GO  TO  80 

90  PMN=(X( BB) * ( EB-BB+1 ) +X( BL) * ( EL-BL+1 ) )/(EB-BL+l) 
BB=BL 

DO  100  I=BB, EB 
X ( I ) =PMN 
100  CONTINUE 
GO  TO  30 
110  RETURN 
END 
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SUBROUTINE  SORT  ( V,  A, II, JJ) 

PUTS  INTO  A  THE  PERMUTATION  VECTOR  WHICH  SORTS  V  INTO 
INCREASING  ORDER.  ONLY  ELEMENTS  FROM  II  TO  JJ  ARE  CONSIDERED. 
ARRAYS  IU ( K )  AND  IL(K)  PERMIT  SORTING  UP  TO  2**(K+1)-1  ELEMENTS 

THIS  IS  A  MODIFICATION  OF  CACM  ALGORITHM  #347  BY  R.  C.  SINGLETON, 
WHICH  IS  A  MODIFIED  HOARE  QUICKSORT. 

DIMENSION  A(JJ),V(1),IU(20),IL(20) 

INTEGER  T, TT 

INTEGER  A 

REAL  V 

M=1 

I=II 

J=JJ 

10  IF  (I.GE.J)  GO  TO  80 
20  K=I 

IJ=(J+I )/2 
T=A(IJ) 

VT=V(IJ) 

IF  (V(I).LE.VT)  GO  TO  30 

A ( I J ) =A ( I ) 

A ( I ) =T 
T=A(IJ) 

V(IJ)=V(I) 

V ( I ) =VT 
VT=V ( I J ) 

30  L=*J 

IF  (V(J).GE.VT)  GO  TO  50 
A(IJ)=A(J) 

A(J)=T 
T=A( IJ ) 

V(IJ)=V(J) 

V  ( J )  =  VT 
VT=V(IJ) 

IF  (V(I).LE.VT)  GO  TO  50 
A  ( I J  )  =A  ( I  ) 

A(I)=T 
T=A( IJ ) 

V(IJ)=V(I) 

V(I)=VT 
VT=V ( I J ) 

GO  TO  50 
40  A(L)=A(K) 

A  ( K )  **TT 
V(L)»V(K) 

V  ( K )  =*VTT 

50  L«L-1 

IF  (V(L).GT.VT)  GO  TO  50 
TT=A(L) 

VTT»V(L) 

60  K-K+l 

IF  (V(K).LT.VT)  GO  TO  60 
IF  (K.LE.L)  GO  TO  40 
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IF  (L-I.LE.J-K)  GO  TO  70 
IL(M)=I 
IU(M)=L 
I=K 
M=M+1 
GO  TO  90 
70  IL(M)=K 
IU(M)=J 
J=L 
M=M+1 
GO  TO  90 
80  M=M-1 

IF  (M.EQ.O)  RETURN 
I=IL(M) 

J=IU(M) 

90  IF  (J-I.GT.10)  GO  TO  20 
IF  (I.EQ.II)  GO  TO  10 
1  =  1-1 

100  1=1+1 

IF  (I.EQ.J)  GO  TO  80 
T=A( 1+1 ) 

VT=V ( 1+1 ) 

IF  (V(I).LE.VT)  GO  TO  100 
K=I 

110  A(K+1)=A(K) 

V ( K+l ) =V ( K ) 

K=K-1 

IF  (VT.LT.V(K))  GO  TO  110 
A  ( K+ 1 )  =T 
V (K+l )=VT 
GO  TO  100 
END 
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